# Plankton 2.0 Challenge Tutorial
![plankton](https://68.media.tumblr.com/avatar_5d78d8a8ae8e_128.png)

## Getting Started

Welcome to the Plantkon 2.0 Challenge origanized by SAP! In this notebook, we provide an introduction to the wonderful plankton world together with a more or less detailed description of the datasets. Besides, we provide a tutorial to classify plankton images using Random Forest classifier.

## Introduction

The SAP Next-Gen program is an innovation platform for the SAP ecosystem enabling companies, partners and universities to connect and innovate with purpose linked to the UN Global Goals. Reimagine the future of industries with exponential technologies. Seed in disruptive innovation with startups. Build skills for digital futures. Showcase thought leadership.

The program leverages a network of 3,300+ educational institutions in 111 countries that are members of the SAP University Alliances and SAP Young Thinkers programs.

In this challenge we will try to tackle the SAP's global goal 14: **Life Below Water**, the aim of this goal is to conserve and sustainably use the oceans, seas, and marine resources for sustainable development.

The Oceanographic Observatory of Villefranche-sur-Mer was established in 1882 by Hermann Fol with the encouragement of Charles Darwin and has a world-wide reputation in  plankton research and also is one of the most “Data Science” aware geo sciences research centres. The observatory has collected about 50 million images of plankton from all over the world with 20 million of them already labelled and ready for use.

The purpose of the challenge is to create a prototype of a new application which will show practical usage of “plankton data” for general public or for industry, using the attributes of plankton images and plankton ability to predict water/air quality.

The teams will be formed at the beginning of the challenge and will consist of the students from different schools representing different skills (business, data science, coding) and helped by SAP developers. The Design thinking sessions will start the teams off on a path of new and inventive applications!

This tutorial is devided into 3 sections:
- A description of the role of plankton as a great bioindicator
- A description of the tools that we will be using (open source and SAP tools)
- A playground with the data to clean it and prepare it (Technical)


## 1- Challenge Context

### Bioindicators

#### Introduction

Naturally occurring Bioindicators are used to assess the health of the environment and are also an important tool for detecting changes in the environment, either positive or negative, and their subsequent effects on human society. There are a certain factors which govern the presence of Bioindicators in environment such as transmission of light, water, temperature, and suspended solids. Through the application of Bioindicators we can predict the natural state of a certain region or the level/degree of contamination. The advantages associated with using Bioindicators are as follows:

- Biological impacts can be determined.
- To monitor synergetic and antagonistic impacts of various pollutants on a creature.
- Early stage diagnosis as well as harmful effects of toxins to plants, as well as human beings, can be monitored.
- Can be easily counted, due to their prevalence.
- Economically viable alternative when compared with other specialized measuring systems.

#### Utilization of Bioindicators

The expression ‘Bioindicator’ is used as an aggregate term referring to all sources of biotic and abiotic reactions to ecological changes. Instead of simply working as gauges of natural change, taxa are utilized to show the impacts of natural surrounding changes, or environmental change. They are used to detect changes in natural surroundings as well as to indicate negative or positive impacts. They can also detect changes in the environment due to the presence of pollutants which can affect the biodiversity of the environment, as well as species present in it. 

The condition of the environment is effectively monitored by the use of Bioindicator species due to their resistance to ecological variability. Hasselbach et al. utilized the moss i.e. Hylocomium splendens as a natural indicator of heavy metals in the remote tundra environment of northwestern Alaska. Here, the ore of mineral is mined from Red Dog Mine, the world’s largest creator of zinc (Zn), and is carried to a singular street (∼75 km long) to storage spaces on the Chukchi Sea. Hasselbach and her partners inspected whether this overland transport was influencing the encompassing physical biota. The contents of heavy metals inside the moss tissue were analyzed at different distances from the street. The concentrations of metals in moss tissue were most prominently adjacent to the haul street and reduced with distance, therefore supporting the theory that overland transport was in fact modifying the encompassing environment. In this study, lichens were utilized as biomonitors by utilizing the quantitative estimation of metal concentrations inside individual lichen.

Natural, biological, and biodiversity markers can be found in various organisms occupying different types of environments. Lichens (a symbiosis among Cyano bacteria, algae, and/or fungi) and Bryophytes (liverworts) are frequently used to monitor air contamination. Both, Lichens and Bryophytes are powerful Bioindicators of air quality on the grounds that they have no roots, no fingernail skin, and acquire all their supplements from immediate introduction to the climate. Their high surface region to volume ratio further supports the theory of their use as a bioindicator, or supports their ability to capture contaminates from the air. Cynophyta, a type of phytoplankton, is one particularly powerful bioindicator which is known to indicate rapid eutrophication of water bodies such as reservoirs, lakes, etc. via the creation of bloom formations

#### Biomonitoring

Bio-organisms are basically used to define the characteristics of a biosphere. These organisms are known as Bioindicators or biomonitors, both of which may vary considerably. When studying the environment the quality of changes taking place can be determined by Bioindicators while biomonitors are used to get quantitative information on the quality of the environment biological monitoring also incorporating data regarding past aggravations and the impacts of various variables.

Monitoring can be done for various biological processes or systems with the objective of observing the temporal and spatial changes in health status, assessing the impacts of specific environment or anthropogenic stressors and assessing the viability of anthropogenic measures (e.g. reclamation, remediation, and reintroduction). The species diversity is used as a prime aspect in biological monitoring, which is considered to be a valuable parameter in determining the health of the environment.

Biomonitoring is one of the essential components for assessing the quality of water and has become an integral element of conducting studies on water pollution. Biomonitors are freely available all around the world. They fundamentally mirror the natural impact over creatures and can be used and understood with minimum preparation and training. Despite the fact that all natural species can be considered biomonitors to some degree, the above focal points applies well to planktons and similar species type, when water pollution is considered.



### Plankton
![whale food](whale_food.png)

#### Introduction

In many water bodies, such as, seas, lakes, streams, and swamps, significant biological production is carried out by plankton. Planktons are composed of organisms with chlorophyll (i.e. phytoplankton and animals such as zooplanktons). These planktons consist of communities that float along currents and tides, yet they fuse and cycle important quantities of energy that is then passed on to higher trophic levels (Walsh 1978).

Planktons react rapidly to ecological changes and are viewed as excellent indicators of water quality and trophic conditions due to their short time and rapid rate of reproduction. Under natural conditions, the occurrence of planktonic organisms is identified with the resistance range in relation to abiotic ecological components (Temperature, Oxygen fixation, and pH) as well as the biotic connections among organisms. The changes that occur within the communities of planktons provide the platform to determine the trophic state of water bodies.

![need](https://kaggle2.blob.core.windows.net/competitions/kaggle/3978/media/Plankton-Diagram3-lg.png)

#### Planktons as an indicator of water pollution

Since planktons are profoundly sensitive to natural change they are best markers of water quality and particularly sea conditions. One of the reasons planktons are being considered in seas is to monitor the water quality of the lake when there are high centralizations of phosphorus and nitrogen; these centralizations may be indicated by certain planktons reproducing at an increased rate. This is evidence of poor water quality that may influence other organisms living in the water body. In addition to being a health indicator, planktons are also the fundamental sustenance for many larger organisms in the sea. Thus the plankton is key to the marine organisms, as both an indicator of water quality and as the main food source for many fish.

Plankton also plays an important role in biological deterioration organic matter; but if plankton populations are too large this creates other problems in managing the water body. Fish at this critical stage of ecological process play an important role by grazing the planktons. The two roles played by fish are very crucial as they help in maintaining the proper balance of planktons in the pond and convert the nutrient available in wastewater into a form which is consumable by humans. Additionally, certain planktons such as cyanobacteria produce toxins which are harmful for fish growth. Thus planktons can be termed as useful or harmful, with respect to wastewater fed production of fishes.

![indicator](http://www.tandfonline.com/na101/home/literatum/publisher/tandf/journals/content/tfls20/2016/tfls20.v009.i02/21553769.2016.1162753/20160621/images/medium/tfls_a_1162753_f0004_b.gif)

#### Phytoplankton

Phytoplanktons, also known as microalgae, are similar to terrestrial plants in that they contain chlorophyll and require daylight to live and develop. Most are light and swim in the upper portion of the sea, where light infiltrates the water. Development and photosynthesis are closely related, each one being a function of usage of light and food supplements. Algae are quite sensitive to contamination, and this may be reflected in their population levels and/or rates or photosynthesis Affects development of population or photosynthesis, for the most part, algae are as sensitive to contaminations as other species. When there is change in the diversity of phytoplankton species, it may indicate pollution of the marine ecosystem.

##### Evidences pertaining to phytoplankton

Phytoplanktons have been used for successful observation of water contamination and are a useful indicator of water quality. 
In 1975, Dugdale depicted the relationship of the growth rate of an algal population, photosynthesis, and nutrient concentration in the water body. Contaminations can influence the connection between rate of growth and each of these variables. For example, if there is an industrial effluent which is colored or contains suspended solids light may be filtered or absorbed causing a reduction in rate of growth. Macisaacand and Dugdale in 1976 showed that a decrease of light leads to decrease in rate of uptake of ammonia and nitrate in marine phytoplankton.

Overnell et al. demonstrated that light prompted oxygen evolution from the freshwater species Chlamydomonas reinhardtii was sensitive to cadmium, methyl mercury, and lead. Moore et al. discovered that organo-chlorine compounds decrease use of bicarbonate by estuarine phytoplankton. Whitacre et al. also produced significant research on the effect of numerous chlorinated hydrocarbons on fixation of carbon by phytoplankton.

Phytoplanktons are also an important source of pollutant transfer from water to upper tropic levels and even to humans. Algae are unable to decompose the pesticides and are thus a link of transfer to herbivores when fed upon. Substances gathering and intake plays an important role in pollution dynamics of phytoplankton. If light is obstructed, it hampers the intake of ammonia and nitrate by aquatic phytoplankton as indicated by Mac Isaac and Dugdale, especially when the industrial colored or solid suspended waste accumulates on the water surface which results in reduction of growth rate, filtrations, and absorption of light.


| Names of phytoplankton                        |             Indications            |
| :--------------------------------------------:|:----------------------------------:|
| Reen algae                                    |Facilitates the growth of fishes    |
| Mosses, liverworts                            |Pollution by accumulation of metals |
| Charophytes                                   |Quality of water                    |
| Selanastrum                                   |Water pollution                     |
| Wolffiaglobose                                |Contamination of cadmium            |
| Euglena gracilis                              |Organic pollution in lakes                                        |
| Chlorella vulgaris                            |Helps in removal of heavy metal contamination from water and soil |
| Chlorococcales like C.vulgaris and A.falcatus |Indicators of the paper industry and sewage waste                 |




#### Zooplanktons

Zooplanktons are microscopic animals living near to the surface of the water body. They are poor swimmers, instead relying on tides and currents as a transport mechanism. They feed upon phytoplanktons, bacterioplanktons, or detritus (i.e. marine snow). Zooplanktons constitute a vital food source for fish. They also play an important role as Bioindicators and help to evaluate the level of water pollution. In freshwater communities, along with fish, they are the main food supplement to many other marine species.

They are assumed to be a vital part in indicating water quality, eutrophication, and production of a freshwater body. In order to determine the status of a freshwater body it is necessary to measure seasonal variations and presence of zooplanktons.

Differing varieties of species, biomass diversity and wealth of zooplankton groups can be utilized to determine the strength of a biological system. The potential of zooplankton as a bioindicator species is high on the grounds that their development and conveyance are subject to some abiotic (e.g. temperature, saltiness, stratification, and pollutants) and biotic parameters (e.g. limitation of food, predation, and competition).

##### Evidences pertaining to zooplanktons

Mechanical fermentation brought on a reduction in the quantity of species and changes in species strength, both of which were influenced as pH decreased from 7.0 to 3.8. Jha and Barat completed research on Lake Mirik, in Darjeeling, Himalayas, on zooplankton. This lake was polluted due to toxins let into the lake from outer sources resulting in a decreased pH in the lake and an increased acidity level. This was confirmed by the investigation of other physiochemical parameters and planktons. In this condition, cladocerans (Bosmina, Moina, and Daphnia) and copepods (Phyllodiaptomus and cylops the most extensive copepods) were found. This examination presumed that the lake cannot be utilized as a deficit for the supply of drinking water and these organisms served as a bioindicator to focus on the wellbeing of this oceanic body. As indicated by Siddiqi and Chandrasekhar, trichotria tetrat is could be utilized as contamination indicators as they were seen in the lake which was rich in phosphorus and other heavy metal particle. This species was obtained in the past in sewage-contaminated tanks. Phosphorous and metal particle as well as high aggregate alkanity, hardness, and high conductivity (130 ms m−1) of the lake water were restricting components for the development of zooplankton.

Zooplankton may be present in an extensive variety of ecological conditions. Yet disintegrated oxygen, temperature, salinity, pH, and other physicochemical parameters are restricting elements. The vicinity of three types of Brachionussp indicates that the lake is being eutrophicated and is naturally contaminated. There is variation in the population of copepods, seasonally in various water bodies present in different parts of India; the seasonal studies of zooplanktons showed that the zooplanktons’ density was highest in the rainy season, while it reduced in summers due to high temperatures. Copepods form the dominant group of all the zooplanktons, followed by Cladocera, rotifer, and Ostrocoda. Ultimately, zooplankton has been found to be excellent an Bioindicator to evaluate the contamination of anyoceanic bodies (saltwater).

| Names of zooplanktons                              |             Indications            |
| :-------------------------------------------------:|:----------------------------------:|
| Rotifers                                           |Trophic status    |
| Keratellatropica, Hexarthramira                    |High turbidity due to suspended sediments |
| Brachionuscalyciflorus                             |Eutrophic conditions and organic pollution of lakes   |
| Cladocerans group (unspecified)                    |Low concentration of contaminants |
| Trichotriatetratis                                 |Pollution caused by accumulation of phosphorous and heavy metal ions |
| Thermocyclops, argyrodiaptomus                     |Eutrophic conditions  |
| B.angularis, Rotatoria                             |Eutrophic conditions |
| Leeches                                            |Indicates contamination because of presence of PCB (polychlorinated biphenyl) in a river |
| Leeches                                            |Sensor-bioindicator of river contamination of PCB’s |
| Oyster (Crassostreagigas), crabs (Geoticadepressa) |Presence of lead                |
| B. dolabrotus                                      |High turbidity due to suspended sediments                 |
| Copepods (Cyclops & phyllodiaptomus)               |Health of the marine body                 |
| Cladocerans (moina, daphnia, bosmina)              |Health of the marine body                 |


![Karen](https://vignette.wikia.nocookie.net/spongebob/images/c/c0/Plankton%27s_Diary_Karen.jpg/revision/latest/scale-to-width-down/320?cb=20150701035248)

## 2- Tools

### Open Source

You are free to use any programming language and any open source library, if you would like to play with the data we recommend Python and the following libraries:

- Pandas
- Numpy
- Sickit-Learn


### SAP


#### Data Analysis and Prediction: SAP Predictive Analytics

SAP Predictive Analytics is a powerfull tool from SAP that have many functional and technical capabilities:

##### Functional Capabilities:

- Automated analytics
    - Let business analysts and data scientists use automated technique to build sophisticated predictive models that can be embedded in business processes – in days, not weeks or months.
- Expert analytics
    - Provide a modeling environment for open-source R-based algorithms, SAP HANA PAL, and SAP Automated Predictive Library (APL). Build predictive models with a powerful drag-and-drop interface, and allow users to use their own R scripts.
- Model management
    - Provide end-to-end model management, maintain peak performance for thousands of predictive models, and schedule updates as needed.
    -Data Manager
    - Data Manager provides a framework to facilitate automated data preparation. Users can define a broad set of reusable components, which can be applied to automatically create modeling data sets.
- Predictive scoring
    - Get individual variable contributions for every predictive model. Simulate, and score of a specific business question – in real time. Generate predictive scoring for a wide variety of target systems and directly embed the results.
- Social and recommendation
    - Run powerful network and link analysis to understand the connections and relationships between your customers – and discover which customers have a strong social influence. This capability can help you better manage churn, risk, and fraud.
- Advanced Visualization
    - Advanced Visualization provides an intuitive way to explore your data. Transform the results of applied predictive modeling into stunning, advanced visualizations that reveal actionable insights.
    
##### Technical Capabilities:

- Predictive automation
    - Support both SAP and non-SAP environments.
- Big Data analytics
    - Automate the building of predictive models based on data stored in Hadoop. Do data manipulation, model training, and retraining directly on Hadoop data using the Spark engine.
- Automated predictive library
    - Use the application’s automated analytics engine and data mining capabilities on your data stored in SAP HANA.
- R integration
    - Leverage tight integration with R to enable a large number of algorithms and custom R scripts for analyzing your data.

#### User Interfaces and Rapid prototyping: SAP BUILD platform
SAP Cloud Platform [BUILD](https://www.build.me/splashapp/) allows you to collaboratively develop prototypes with your project team, engage end-users for feedback, or jumpstart your designs with one of many prototype examples from the gallery – all while learning the design process. Discover how Build’s comprehensive tools are the next frontier in creating design-led applications.
is a web application that makes creating and designing user interfaces a very easy task, just drag and drop the components and BUILD will handle source code generation.



In [1]:
from IPython.display import HTML

# Vimeo
HTML('<iframe src="https://player.vimeo.com/video/182832619?title=0&byline=0&portrait=0" width="700" height="394" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe><p><a href="https://vimeo.com/182832619">Introducing BUILD</a>.</p>')


## 3- Data sets
### Description

In the context of the hackathon, we have prepared the following data sets
- Villefranche sur mer Hydrology since 2004 (chlorophyll level, temperature, depth, oxygen level, ..etc)
- Plankton Biomass abundance in Villefranche sur mer
- Maritime traffic in Mediterranean sea
- Overfishing in the Mediterranean sea
- Annotated Plankton images
- Annotated plankton drawings
- Climate change indicators:
    - Average Global Sea Surface Temperature from 1880 until 2015
    - Global Average Absolute Sea Level Change from 1880 until 2015
    
#### Hydrology measurements in Villefranche sur Mer

Hydrology is the scientific study of the movement, distribution, and quality of water on Earth and other planets, including the water cycle, water resources and environmental watershed sustainability. A practitioner of hydrology is a hydrologist, working within the fields of earth or environmental science, physical geography, geology or civil and environmental engineering. Using various analytical methods and scientific techniques, they collect and analyze data to help solve water related problems such as environmental preservation, natural disasters, and water management.

Hydrology subdivides into surface water hydrology, groundwater hydrology (hydrogeology), and marine hydrology. Domains of hydrology include hydrometeorology, surface hydrology, hydrogeology, drainage-basin management and water quality, where water plays the central role.

The file **hydro_vlfr.csv** represents the hydrology of villefranche sur mer as shown in http://www.obs-vlfr.fr/data/view/radehydro/std/. These data correspont to the depth between 1 and 50m.

## TODO: explain interactions between hydrology variables (oxygen, nitrate, chlorophyll ..etc)


##### Featureset Exploration
* **date**
* **depth**
* **Chlorophyll a [µg/L]**
* **Nitrate [µmol/L]**
* **Oxygen (Winkler) [mL/L]** 
* **Particulate organic carbon [µg/L]** 
* **Particulate organic nitrogen [µg/L]** 
* **Phaeopigments [µg/L]** 
* **Phosphate [µmol/L]** 
* **Pot. density [kg/m3 - 1000]** 
* **Salinity** 
* **Silicate [µmol/L]** 
* **Temperature [ºC]**


#### Biomass Abundance in Villefranche sur Mer

The file **WP2_vlfr.csv** contains the data that are displayed in 
these data describe the abundance of plankton in Villefranche. For each taxinomic group, there are 3 variables:
* **ESD**: Equivalent Spherical Diameter - the avaergae size of ogranisms in a week.
* **concentration**: Number of organisms per m3
* **biovolume**: living organisms volume per m3

The data correspond to groups and all sub-groups, for example "living" represents the average size, teh concentration and the volume of all the living organisms during the week of observation.


##### Featureset Exploration
* **date**: observation date
* **id**: observation's id
* **esd**: equivalent spherical diameter - average size of the organism in a week
* **concentration**: number of organism of this taxinomic group per m3
* **biovolume**: living organism volume per m3 of the taxinomic group
* **name**: name of the taxinomic group

#### Maritime Traffic in Mediterranean 

Daily, an average of 150 000 ships are transitting in the world, 

#### Overfishing in Mediterranean

## TODO include cleaned datasets
http://www.fao.org/gfcm/data/capture-production-statistics/fr/
https://www.grida.no/resources/5937


#### Annotated plankton images
For this competition, Hatfield scientists have prepared a large collection of labeled images, approximately 30k of which are provided as a training set. Each raw image was run through an automatic process to extract regions of interest, resulting in smaller images that contain a single organism/entity. You must create an algorithm that assigns class probabilities to a given image. Several characteristics of this problem make this classification difficult:

There are many different species, ranging from the smallest single-celled protists to copepods, larval fish, and larger jellies.
Representatives from each taxon can have any orientation within 3-D space.
The ocean is replete with detritus (often decomposing plant or animal matter that scientists like to call “whale snot”) and fecal pellets that have no taxonomic identification but are important in other marine processes.
Some images are so noisy or ambiguous that experts have a difficult time labeling them. Some amount of noise in the ground truth is thus inevitable.
The presence of "unknown" classes require models to handle the special cases of unidentifiable objects.
![plankton](https://kaggle2.blob.core.windows.net/competitions/kaggle/3978/media/plankton%20schmorgasborg.jpg)

The file **train.zip** contains labeled images for training. Each folder represents a class. Images within the folder belong to that class. Classes were chosen by the Hatfield experts and represent scientifically meaningful groupings of organisms and objects (see the below FAQs). The file contains also a special folder **info** which contains TSV files that contain contextual information about each of the images.  The featureset of the *.tsv files is:

* **object_id **
* **object_lat
* **object_lon
* **object_date
* **object_time
* **object_link
* **object_depth_min
* **object_depth_max
* **object_annotation_category
* **object_annotation_status
* **object_annotation_parent_category
* **object_annotation_hierarchy
* **object_annotation_person_name
* **object_annotation_person_email
* **object_annotation_date
* **object_annotation_time
* **img_file_name
* **object_area
* **object_slope
* **object_feret
* **object_compm3
* **object_symetrievc
* **object_ystart
* **object_min
* **object_minor
* **object_mean
* **object_ym
* **object_circ.
* **object_compslope
* **object_y
* **object_by
* **object_major
* **object_xstart
* **object_thickr
* **object_perimmajor
* **object_compm1* **object_intden
* **object_xmg5* **object_lon_end
* **object_width* **object_elongation* **object_stddev
* **object__area
* **object_nb1
* **object_tag* **object_kurt* **object_x* **object_compmean* **object_feretareaexc* **object_fractal* **object_mode* **object_height* **object_convarea* **object_convperim* **object_perimareaexc* **object_sr* **object_xm* **object_centroids
* **object_perimferet* **object_perim.* **object_angle* **object_nb2* **object_symetrieh* **object_histcum1* **object_compentropy* **object_symetriehc* **object_bx* **object_fcons* **object_skew* **object_meanpos* **object_nb3* **object_cv* **object_median* **object_range* **object_circex* **object_skelarea* **object_histcum3* **object_area_exc* **object_compm2* **object_ymg5* **object_cdexc* **object_lat_end* **object_esd* **object_histcum2* **object_max* **object_symetriev* **sample_id* **sample_dataportal_descriptor* **sample_cable_length* **sample_ship* **sample_tot_vol* **sample_cable_angle* **sample_other_ref* **sample_ship_speed* **sample_duration* **sample_net_mesh* **sample_barcode* **sample_ctdrosettefilename* **sample_zmax* **sample_bottomdepth* **sample_nb_jar* **sample_stationid* **sample_cable_speed* **sample_tow_type* **sample_program* **sample_tot_vol_qc* **sample_scan_operator* **sample_zmin* **sample_net_surf* **sample_net_type* **sample_tow_nb* **sample_comment* **sample_depth_qc* **sample_sample_qc* **sample_open* **process_id* **process_particle_sep_mask* **process_particle_threshold* **process_img_background_img* **process_time* **process_software* **process_img_software_version* **process_date* **process_particle_bw_ratio* **process_particle_max_size_mm* **process_particle_version* **process_particle_pixel_size_mm* **process_particle_min_size_mm* **process_img_od_std* **process_img_resolution* **process_img_od_grey* **acq_id* **acq_instrument* **acq_lut_min* **acq_max_mesh* **acq_rotation* **acq_lut_max* **acq_author* **acq_lut_filter* **acq_lut_color_balance* **acq_quality* **acq_yoffset* **acq_software* **acq_greyfrom* **acq_hardware* **acq_lut_ratio* **acq_sub_method* **acq_lut_odrange* **acq_scan_date* **acq_sub_part* **acq_xsize* **acq_min_mesh* **acq_scan_resolution* **acq_bitpixel* **acq_scan_time* **acq_imgtype* **acq_miror* **acq_ysize* **acq_xoffset* **acq_lut_16b_median


The list of classes included is the following:

```
Abylopsis tetragona_eudoxie
Abylopsis tetragona_gonophore
Abylopsis tetragona_nectophore
Actinopterygii_egg
Agalma elegans_siphonula
Annelida_larvae
Annelida_part
Appendicularia_head
Appendicularia_Oikopleuridae
Appendicularia_tail
artefact_badfocus
artefact_bubble
Augaptilidae_Haloptilus
Brachyura_like
Brachyura_megalopa
Bryozoa_cyphonaute
Calanoida_Acartiidae
Calanoida_Calanidae
Calanoida_Candaciidae
Calanoida_Centropagidae
Calanoida_Temoridae
Campanulariidae_Obelia
Cavolinia inflexa_egg
Cavolinia_Cavolinia inflexa
Ceratiaceae_Neoceratium
Chaetognatha_head
Chaetognatha_tail
Cirripedia_nauplii
Clio_Clio pyramidata
Cnidaria_Hydrozoa
Cnidaria_part
Copepoda_Calanoida
Copepoda_dead
Copepoda_Harpacticoida
Copepoda_multiple
Copepoda_Poecilostomatoida
Creseis_Creseis acicula
Crustacea_nauplii
Crustacea_part
Cyclopoida_Oithonidae
Decapoda_zoea
detritus_fiber
Diphyidae_eudoxie
Diphyidae_gonophore
Diphyidae_nectophore
Eukaryota_Harosa
Eumalacostraca_Amphipoda
Euphausiacea_calyptopsis
Flaccisagitta_Flaccisagitta enflata
Fritillariidae_Fritillaria
Galatheidae_zoea
Gnathostomata_Actinopterygii
Harpacticoida_Euterpina
Hexacorallia_Actiniaria
Hippopodiidae_nectophore
Maxillopoda_Copepoda
Metazoa_Annelida
Metazoa_Chaetognatha
Metazoa_Echinodermata
Metazoa_Mollusca
Metridinidae_Pleuromamma
Mollusca_Bivalvia
not-living_artefact
not-living_detritus
Oikopleuridae_Oikopleura
Oligostraca_Ostracoda
other_egg
other_multiple
other_othertocheck
other_seaweed
Penaeidae_protozoea
Penilia_Penilia avirostris
Physonectae_nectophore
Podonidae_Evadne
Podonidae_Podon
Poecilostomatoida_Corycaeidae
Poecilostomatoida_Oncaeidae
Poecilostomatoida_Sapphirinidae
Rhopalonematidae_Aglaura
Rhopalonematidae_Rhopalonema
Salpida_colony
Salpida_larvae
Scyphozoa_ephyra
Siphonophorae_Physonectae
Siphonophorae_siphosome
Styliola_Styliola subula
Thaliacea_Doliolida
Thaliacea_Pyrosomatida
Thaliacea_Salpida
Thecofilosea_Phaeodaria
Thecosomata_Creseidae
Thecosomata_Limacinidae
Thecostraca_Cirripedia
Tunicata_Appendicularia

```
The file **test.zip** contains the test set images, for which you must predict class probabilities. 

**plankton_identification.pdf** is provided as a rough guide to understand relationships between the classes. The tree-like diagram indicates morphological and biological connections between groups. Dashed lines indicate a weak(er) relationship and solid lines a stronger relationship.

#### Annotated plankton drawings

The file **drawing-train.zip** contains the training set for the drawings. Each folder represents a class. Images within the folder belong to that class. Classes represent scientifically meaningful groupings of organisms and objects (see the below FAQs). 

The file **drawing-test.zip** contains the test set drawings.

#### Average Global Sea Surface Temperature
Sea surface temperature—the temperature of the water at the ocean surface—is an important physical attribute of the world’s oceans. The surface temperature of the world’s oceans varies mainly with latitude, with the warmest waters generally near the equator and the coldest waters in the Arctic and Antarctic regions. As the oceans absorb more heat, sea surface temperature increases and the ocean circulation patterns that transport warm and cold water around the globe change.

Changes in sea surface temperature can alter marine ecosystems in several ways. For example, variations in ocean temperature can affect what species of plants, animals, and microbes are present in a location, alter migration and breeding patterns, threaten sensitive ocean life such as corals, and change the frequency and intensity of harmful algal blooms such as “red tide.”Over the long term, increases in sea surface temperature could also reduce the circulation patterns that bring nutrients from the deep sea to surface waters. Changes in reef habitat and nutrient supply could dramatically alter ocean ecosystems and lead to declines in fish populations, which in turn could affect people who depend on fishing for food or jobs.

Because the oceans continuously interact with the atmosphere, sea surface temperature can also have profound effects on global climate. Increases in sea surface temperature have led to an increase in the amount of atmospheric water vapor over the oceans. This water vapor feeds weather systems that produce precipitation, increasing the risk of heavy rain and snow (see the Heavy Precipitation and Tropical Cyclone Activity indicators). Changes in sea surface temperature can shift storm tracks, potentially contributing to droughts in some areas. Increases in sea surface temperature are also expected to lengthen the growth season for certain bacteria that can contaminate seafood and cause foodborne illnesses, thereby increasing the risk of health effects.

##### Featureset Exploration
* **Year**
* **Annual anomaly**
* **Lower 95% confidence interval**
* **Upper 95% confidence interval**

#### Global Average Absolute Sea Level Change
As the temperature of the Earth changes, so does sea level. Temperature and sea level are linked for two main reasons:
Changes in the volume of water and ice on land (namely glaciers and ice sheets) can increase or decrease the volume of water in the ocean (see the Glaciers indicator).

As water warms, it expands slightly—an effect that is cumulative over the entire depth of the oceans (see the Ocean Heat indicator).

Changing sea levels can affect human activities in coastal areas. Rising sea level inundates low-lying wetlands and dry land, erodes shorelines, contributes to coastal flooding, and increases the flow of salt water into estuaries and nearby groundwater aquifers. Higher sea level also makes coastal infrastructure more vulnerable to damage from storms.

The sea level changes that affect coastal systems involve more than just expanding oceans, however, because the Earth’s continents can also rise and fall relative to the oceans. Land can rise through processes such as sediment accumulation (the process that built the Mississippi River delta) and geological uplift (for example, as glaciers melt and the land below is no longer weighed down by heavy ice). In other areas, land can sink because of erosion, sediment compaction, natural subsidence (sinking due to geologic changes), groundwater withdrawal, or engineering projects that prevent rivers from naturally depositing sediments along their banks. Changes in ocean currents such as the Gulf Stream can also affect sea levels by pushing more water against some coastlines and pulling it away from others, raising or lowering sea levels accordingly.

Scientists account for these types of changes by measuring sea level change in two different ways. Relative sea level change refers to how the height of the ocean rises or falls relative to the land at a particular location. In contrast, absolute sea level change refers to the height of the ocean surface above the center of the earth, without regard to whether nearby land is rising or falling.


Absolute sea level trends were provided by Australia’s Commonwealth Scientific and Industrial Research Organisation and the National Oceanic and Atmospheric Administration. These data are based on measurements collected by satellites and tide gauges. Relative sea level data are available from the National Oceanic and Atmospheric Administration, which publishes an interactive online map (http://tidesandcurrents.noaa.gov/sltrends/sltrends.shtml) with links to detailed data for each tide gauge.


##### Featureset Exploration
* **Year**
* **CSIRO - Adjusted sea level (inches)**
* **CSIRO - Lower error bound (inches)**
* **CSIRO - Upper error bound (inches)**
* **NOAA - Adjusted sea level (inches)**



## 4- Tutorial


In this tutorial, we will go step-by-step through a simple model to distinguish different types of plankton and demonstrate some tools for exploring the image dataset. We will start by going through an example of one image to show how you could choose to develop a metric based on the shape of the object within the image. First, we import the necessary modules from scikit-image, matplotlib, scikit-learn, and numpy. If you don't currently have python installed, you can get the Anaconda distribution that includes all of the referenced packages below.

In [None]:
#Import libraries for doing image analysis
from skimage.io import imread
from skimage.transform import resize
from sklearn.ensemble import RandomForestClassifier as RF
import glob
import os
from sklearn import cross_validation
from sklearn.cross_validation import StratifiedKFold as KFold
from sklearn.metrics import classification_report
from matplotlib import pyplot as plt
from matplotlib import colors
from pylab import cm
from skimage import segmentation
from skimage.morphology import watershed
from skimage import measure
from skimage import morphology
import numpy as np
import pandas as pd
from scipy import ndimage
from skimage.feature import peak_local_max

# make graphics inline
%matplotlib inline

In [None]:
import warnings
warnings.filterwarnings("ignore")

### Importing the Data

The training data is organized in a series of subdirectories that contain examples for the each class of interest. We will store the list of directory names to aid in labelling the data classes for training and testing purposes.

In [None]:
# get the classnames from the directory structure
directory_names = list(set(glob.glob(os.path.join("competition_data","train", "*"))\
 ).difference(set(glob.glob(os.path.join("competition_data","train","*.*")))))

### Example Image

We will develop our feature on one image example and examine each step before calculating the feature across the distribution of classes.

In [None]:
# Example image
# This example was chosen for because it has two noncontinguous pieces
# that will make the segmentation example more illustrative
example_file = glob.glob(os.path.join(directory_names[5],"*.jpg"))[9]
print example_file
im = imread(example_file, as_grey=True)
plt.imshow(im, cmap=cm.gray)
plt.show()

### Preparing the Images

To create the features of interest, we will need to prepare the images by completing a few preprocessing procedures. We will step through some common image preprocessing actions: thresholding the images, segmenting the images, and extracting region properties. Using the region properties, we will create features based on the intrinsic properties of the classes, which we expect will allow us discriminate between them. Let's walk through the process of adding one such feature for the ratio of the width by length of the object of interest.

First, we begin by thresholding the image on the the mean value. This will reduce some of the noise in the image. Then, we apply a three step segmentation process: first we dilate the image to connect neighboring pixels, then we calculate the labels for connected regions, and finally we apply the original threshold to the labels to label the original, undilated regions.

In [None]:
# First we threshold the image by only taking values greater than the mean to reduce noise in the image
# to use later as a mask
f = plt.figure(figsize=(12,3))
imthr = im.copy()
imthr = np.where(im > np.mean(im),0.,1.0)
sub1 = plt.subplot(1,4,1)
plt.imshow(im, cmap=cm.gray)
sub1.set_title("Original Image")

sub2 = plt.subplot(1,4,2)
plt.imshow(imthr, cmap=cm.gray_r)
sub2.set_title("Thresholded Image")

imdilated = morphology.dilation(imthr, np.ones((4,4)))
sub3 = plt.subplot(1, 4, 3)
plt.imshow(imdilated, cmap=cm.gray_r)
sub3.set_title("Dilated Image")

labels = measure.label(imdilated)
labels = imthr*labels
labels = labels.astype(int)
sub4 = plt.subplot(1, 4, 4)
sub4.set_title("Labeled Image")
plt.imshow(labels)

With the image segmented into different parts, we would like to choose the largest non-background part to compute our metric. We would like to select the largest segment as the likely object of interest for classification purposes. We loop through the available regions and select the one with the largest area. There are many properties available within the regions that you can explore for creating new features. Look at the documentation for regionprops for inspiration.

In [None]:
# calculate common region properties for each region within the segmentation
regions = measure.regionprops(labels)
# find the largest nonzero region
def getLargestRegion(props=regions, labelmap=labels, imagethres=imthr):
    regionmaxprop = None
    for regionprop in props:
        # check to see if the region is at least 50% nonzero
        if sum(imagethres[labelmap == regionprop.label])*1.0/regionprop.area < 0.50:
            continue
        if regionmaxprop is None:
            regionmaxprop = regionprop
        if regionmaxprop.filled_area < regionprop.filled_area:
            regionmaxprop = regionprop
    return regionmaxprop

The results for our test image are shown below. The segmentation picked one region and we use that region to calculate our ratio metric.

In [None]:
regionmax = getLargestRegion()
plt.imshow(np.where(labels == regionmax.label,1.0,0.0))
plt.show()

In [None]:
print(regionmax.minor_axis_length/regionmax.major_axis_length)

Now, we collect the previous steps together in a function to make it easily repeatable.

In [None]:
def getMinorMajorRatio(image):
    image = image.copy()
    # Create the thresholded image to eliminate some of the background
    imagethr = np.where(image > np.mean(image),0.,1.0)

    #Dilate the image
    imdilated = morphology.dilation(imagethr, np.ones((4,4)))

    # Create the label list
    label_list = measure.label(imdilated)
    label_list = imagethr*label_list
    label_list = label_list.astype(int)
    
    region_list = measure.regionprops(label_list)
    maxregion = getLargestRegion(region_list, label_list, imagethr)
    
    # guard against cases where the segmentation fails by providing zeros
    ratio = 0.0
    if ((not maxregion is None) and  (maxregion.major_axis_length != 0.0)):
        ratio = 0.0 if maxregion is None else  maxregion.minor_axis_length*1.0 / maxregion.major_axis_length
    return ratio

### Preparing Training Data

With our code for the ratio of minor to major axis, let's add the raw pixel values to the list of features for our dataset. In order to use the pixel values in a model for our classifier, we need a fixed length feature vector, so we will rescale the images to be constant size and add the fixed number of pixels to the feature vector.

To create the feature vectors, we will loop through each of the directories in our training data set and then loop over each image within that class. For each image, we will rescale it to 25 x 25 pixels and then add the rescaled pixel values to a feature vector, X. The last feature we include will be our width-to-length ratio. We will also create the class label in the vector y, which will have the true class label for each row of the feature vector, X.

In [None]:
# Rescale the images and create the combined metrics and training labels

#get the total training images
numberofImages = 0
for folder in directory_names:
    for fileNameDir in os.walk(folder):   
        for fileName in fileNameDir[2]:
             # Only read in the images
            if fileName[-4:] != ".jpg":
              continue
            numberofImages += 1

# We'll rescale the images to be 25x25
maxPixel = 25
imageSize = maxPixel * maxPixel
num_rows = numberofImages # one row for each image in the training dataset
num_features = imageSize + 1 # for our ratio

# X is the feature vector with one row of features per image
# consisting of the pixel values and our metric
X = np.zeros((num_rows, num_features), dtype=float)
# y is the numeric class label 
y = np.zeros((num_rows))

files = []
# Generate training data
i = 0    
label = 0
# List of string of class names
namesClasses = list()

print "Reading images"
# Navigate through the list of directories
for folder in directory_names:
    # Append the string class name for each class
    currentClass = folder.split(os.pathsep)[-1]
    namesClasses.append(currentClass)
    for fileNameDir in os.walk(folder):   
        for fileName in fileNameDir[2]:
            # Only read in the images
            if fileName[-4:] != ".jpg":
              continue
            
            # Read in the images and create the features
            nameFileImage = "{0}{1}{2}".format(fileNameDir[0], os.sep, fileName)            
            image = imread(nameFileImage, as_grey=True)
            files.append(nameFileImage)
            axisratio = getMinorMajorRatio(image)
            image = resize(image, (maxPixel, maxPixel))
            
            # Store the rescaled image pixels and the axis ratio
            X[i, 0:imageSize] = np.reshape(image, (1, imageSize))
            X[i, imageSize] = axisratio
            
            # Store the classlabel
            y[i] = label
            i += 1
            # report progress for each 5% done  
            report = [int((j+1)*num_rows/20.) for j in range(20)]
            if i in report: print np.ceil(i *100.0 / num_rows), "% done"
    label += 1

### Width-to-Length Ratio Class Separation

Now that we have calculated the width-to-length ratio metric for all the images, we can look at the class separation to see how well our feature performs. We'll compare pairs of the classes' distributions by plotting each pair of classes. While this will not cover the whole space of hundreds of possible combinations, it will give us a feel for how similar or dissimilar different classes are in this feature, and the class distributions should be comparable across subplots.

In [None]:
# Loop through the classes two at a time and compare their distributions of the Width/Length Ratio

#Create a DataFrame object to make subsetting the data on the class 
df = pd.DataFrame({"class": y[:], "ratio": X[:, num_features-1]})

f = plt.figure(figsize=(30, 20))
#we suppress zeros and choose a few large classes to better highlight the distributions.
df = df.loc[df["ratio"] > 0]
minimumSize = 20 
counts = df["class"].value_counts()
largeclasses = [int(x) for x in list(counts.loc[counts > minimumSize].index)]
# Loop through 40 of the classes 
for j in range(0,40,2):
    subfig = plt.subplot(4, 5, j/2 +1)
    # Plot the normalized histograms for two classes
    classind1 = largeclasses[j]
    classind2 = largeclasses[j+1]
    n, bins,p = plt.hist(df.loc[df["class"] == classind1]["ratio"].values,\
                         alpha=0.5, bins=[x*0.01 for x in range(100)], \
                         label=namesClasses[classind1].split(os.sep)[-1], normed=1)

    n2, bins,p = plt.hist(df.loc[df["class"] == (classind2)]["ratio"].values,\
                          alpha=0.5, bins=bins, label=namesClasses[classind2].split(os.sep)[-1],normed=1)
    subfig.set_ylim([0.,10.])
    plt.legend(loc='upper right')
    plt.xlabel("Width/Length Ratio")

From the (truncated) figure above, you will see some cases where the classes are well separated and others were they are not. It is typical that one single feature will not allow you to completely separate more than thirty distinct classes. You will need to be creative in coming up with additional metrics to discriminate between all the classes.

### Random Forest Classification

We choose a random forest model to classify the images. Random forests perform well in many classification tasks and have robust default settings. We will give a brief description of a random forest model so that you can understand its two main free parameters: n_estimators and max_features.

A random forest model is an ensemble model of n_estimators number of decision trees. During the training process, each decision tree is grown automatically by making a series of conditional splits on the data. At each split in the decision tree, a random sample of max_features number of features is chosen and used to make a conditional decision on which of the two nodes that the data will be grouped in. The best condition for the split is determined by the split that maximizes the class purity of the nodes directly below. The tree continues to grow by making additional splits until the leaves are pure or the leaves have less than the minimum number of samples for a split (in sklearn default for min_samples_split is two data points). The final majority class purity of the terminal nodes of the decision tree are used for making predictions on what class a new data point will belong. Then, the aggregate vote across the forest determines the class prediction for new samples.

With our training data consisting of the feature vector X and the class label vector y, we will now calculate some class metrics for the performance of our model, by class and overall. First, we train the random forest on all the available data and let it perform the 5-fold cross validation. Then we perform the cross validation using the KFold method, which splits the data into train and test sets, and a classification report. The classification report provides a useful list of performance metrics for your classifier vs. the internal metrics of the random forest module.

In [None]:
print("Training")
# n_estimators is the number of decision trees
# max_features also known as m_try is set to the default value of the square root of the number of features
clf = RF(n_estimators=100, n_jobs=3);
scores = cross_validation.cross_val_score(clf, X, y, cv=5, n_jobs=1);
print("Accuracy of all classes")
print(np.mean(scores))

In [None]:
kf = KFold(y, n_folds=5)
y_pred = y * 0
for train, test in kf:
    X_train, X_test, y_train, y_test = X[train,:], X[test,:], y[train], y[test]
    clf = RF(n_estimators=100, n_jobs=3)
    clf.fit(X_train, y_train)
    y_pred[test] = clf.predict(X_test)
print(classification_report(y, y_pred, target_names=namesClasses))

The current model, while somewhat accurate overall, doesn't do well for all classes, including the shrimp caridean, stomatopod, or hydromedusae tentacles classes. For others it does quite well, getting many of the correct classifications for trichodesmium_puff and copepod_oithona_eggs classes. The metrics shown above for measuring model performance include precision, recall, and f1-score. The precision metric gives probability that a chosen class is correct, (true positives / (true positive + false positives)), while recall measures the ability of the model correctly classify examples of a given class, (true positives / (false negatives + true positives)). The F1 score is the geometric average of the precision and recall.

The competition scoring uses a multiclass log-loss metric to compute your overall score. In the next steps, we define the multiclass log-loss function and compute your estimated score on the training dataset.

In [None]:
def multiclass_log_loss(y_true, y_pred, eps=1e-15):
    """Multi class version of Logarithmic Loss metric.
    https://www.kaggle.com/wiki/MultiClassLogLoss

    Parameters
    ----------
    y_true : array, shape = [n_samples]
            true class, intergers in [0, n_classes - 1)
    y_pred : array, shape = [n_samples, n_classes]

    Returns
    -------
    loss : float
    """
    predictions = np.clip(y_pred, eps, 1 - eps)

    # normalize row sums to 1
    predictions /= predictions.sum(axis=1)[:, np.newaxis]

    actual = np.zeros(y_pred.shape)
    n_samples = actual.shape[0]
    actual[np.arange(n_samples), y_true.astype(int)] = 1
    vectsum = np.sum(actual * np.log(predictions))
    loss = -1.0 / n_samples * vectsum
    return loss

In [None]:
# Get the probability predictions for computing the log-loss function
kf = KFold(y, n_folds=5)
# prediction probabilities number of samples, by number of classes
y_pred = np.zeros((len(y),len(set(y))))
for train, test in kf:
    X_train, X_test, y_train, y_test = X[train,:], X[test,:], y[train], y[test]
    clf = RF(n_estimators=100, n_jobs=3)
    clf.fit(X_train, y_train)
    y_pred[test] = clf.predict_proba(X_test)

In [None]:
multiclass_log_loss(y, y_pred)

The multiclass log loss function is an classification error metric that heavily penalizes you for being both confident (either predicting very high or very low class probability) and wrong. Throughout the competition you will want to check that your model improvements are driving this loss metric lower.

### Where to Go From Here
Now that you've made a simple metric, created a model, and examined the model's performance on the training data, the next step is to make improvements to your model to make it more competitive. The random forest model we created does not perform evenly across all classes and in some cases fails completely. By creating new features and looking at some of your distributions for the problem classes directly, you can identify features that specifically help separate those classes from the others. You can add new metrics by considering other image properties, stratified sampling, transformations, or other models for the classification.