Minor additions and bug fixes (#13)

* Water raman scans processing and viz * Debugging the S3 demo data download * attempting to migrate from circleCI to github actions * attempting to migrate from circleCI to github actions * attempting to migrate from circleCI to github actions * attempting to migrate from circleCI to github actions * attempting to migrate from circleCI to github actions * attempting to migrate from circleCI to github actions * attempting to migrate from circleCI to github actions * Playing with github actions. Publish to pypi on release. * integrating pre-commit and black * getting the GH action linter working * GH action for docs * GH action for docs * Debugging GH action for docs * Debugging GH action for docs * Debugging GH action for docs * increment minor version for new release * added some tests for new plotting functions. * debugging codecov GH action. * debugging codecov GH action. * debugging codecov GH action. * debugging codecov GH action. * Update README * JOSS paper prep. * Added the MIT REMORA instrument and fixed minor bugs.
drewmee · Jun 11, 2021 · d9407b5 · d9407b5
1 parent 0f41b69
commit d9407b5
Show file tree

Hide file tree

Showing 16 changed files with 254 additions and 53 deletions.
diff --git a/.github/workflows/codecov.yml b/.github/workflows/codecov.yml
@@ -0,0 +1,31 @@
+name: Codecov
+on: [push]
+jobs:
+  run:
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        os: [ubuntu-latest]
+    env:
+      OS: ${{ matrix.os }}
+      PYTHON: '3.7'
+    steps:
+    - uses: actions/checkout@master
+    - name: Setup Python
+      uses: actions/setup-python@master
+      with:
+        python-version: 3.7
+    - name: Generate coverage report
+      run: |
+        python -m pip install --upgrade pip
+        pip install -e .[tests]
+        pip install pytest-cov
+        pytest --cov=./ --cov-report=xml
+    - name: Upload coverage to Codecov
+      uses: codecov/codecov-action@v1.0.5
+      with:
+          token: ${{ secrets.CODECOV_TOKEN }}
+          file: ./coverage.xml
+          flags: unittests
+          name: codecov-umbrella
+          fail_ci_if_error: true
diff --git a/README.md b/README.md
@@ -1,13 +1,13 @@
 # PyEEM
 
-![Test](https://github.com/drewmee/PyEEM/workflows/Test/badge.svg)
-[![Read the Docs](https://readthedocs.org/projects/pyeem/badge/?version=latest)](https://pyeem.readthedocs.io/)
 [![PyPi version](https://img.shields.io/pypi/v/pyeem.svg 'pypi version')](https://pypi.org/project/pyeem/)
 [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pyeem.svg)](https://pypi.org/project/pyeem/)
+[![Test](https://github.com/drewmee/PyEEM/workflows/Test/badge.svg)](https://github.com/drewmee/PyEEM/actions?query=workflow%3ATest)
+[![Read the Docs](https://readthedocs.org/projects/pyeem/badge/?version=latest)](https://pyeem.readthedocs.io/)
+[![codecov](https://codecov.io/gh/drewmee/PyEEM/branch/master/graph/badge.svg?token=RAPG3XDZ6H)](https://codecov.io/gh/drewmee/PyEEM)
+[![Code style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/drewmee/PyEEM/master?filepath=docs%2Fsource%2Ftutorials%2Fnotebooks)
 [![License](https://img.shields.io/github/license/mashape/apistatus.svg)](https://github.com/drewmee/PyEEM/blob/master/LICENSE)
-[![Code style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
-<!--- Badge for codecov -->
 
 Python library for the preprocessing, analysis, and visualization of Excitation Emission Matrices (EEMs).
 

diff --git a/docs/source/LICENSE_opcsim b/docs/source/LICENSE_opcsim
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2016-2020 David H Hagan and Jesse H Kroll
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/docs/source/LICENSE → docs/source/LICENSE_pyunfold b/docs/source/LICENSE → docs/source/LICENSE_pyunfold
diff --git a/docs/source/tutorials/notebooks/tutorial_2.ipynb b/docs/source/tutorials/notebooks/tutorial_2.ipynb
diff --git a/paper/paper.md b/paper/paper.md
@@ -1,5 +1,5 @@
 ---
-title: 'PyEEM: A Python library for the preprocessing, correction, deconvolution and analysis of Excitation Emission Matrices (EEMs).'
+title: 'PyEEM: A Python library for the preprocessing, correction, and analysis of Excitation Emission Matrices (EEMs).'
 tags:
  - python
  - fluorescence
@@ -9,10 +9,12 @@ tags:
 authors:
  - name: Drew Meyers
    affiliation: "1, 2"
+ - name: Jay W Rutherford
+   affiliation: 3
  - name: Qinmin Zheng
    affiliation: 2
  - name: Fabio Duarte
-   affiliation: "2, 3"
+   affiliation: "2, 4"
  - name: Carlo Ratti
    affiliation: 2  
  - name: Harold H Hemond
@@ -24,42 +26,24 @@ affiliations:
    index: 1
  - name: Senseable City Lab, Massachusetts Institute of Technology
    index: 2
- - name: Pontifícia Universidade Católica do Paraná, Brazil
+ - name: Department of Chemical Engineering, University of Washington
    index: 3
+ - name: Pontifícia Universidade Católica do Paraná, Brazil
+   index: 4
 date: 2020-07-08
 bibliography: paper.bib
 ---
 
-# Statement of Need
-
-Fluorescence Excitation and Emission Matrix Spectroscopy (EEMs) is a popular analytical technique in environmental monitoring. In particular, it has been applied extensively to investigate the composition and concentration of dissolved organic matter (DOM) in aquatic systems [@Coble1990;@McKnight2001;@Fellman2010]. Historically, EEMs have been combined with multi-way techniques such as PCA, ICA, and PARAFAC in order to decompose chemical mixtures [@Bro1997;@Stedmon2008;@Murphy2013;@CostaPereira2018]. More recently, machine learning approaches such as convolutional neural networks (CNNs) and autoencoders have been applied to EEMs for source sepearation of chemical mixtures [@Cuss2016;@Peleato2018;@Ju2019;@Rutherford2020]. However, before these source separation techniques can be performed, several preprocessing and correction steps must be applied to the raw EEMs. In order to achieve comparability between studies, standard methods to apply these corrections have been developed [@Ohno2002;@Bahram2006;@Lawaetz2009;@R.Murphy2010;@Murphy2011;@Kothawala2013]. These standard methods have been implemented in Matlab and R packages [@Murphy2013;@Massicotte;Pucher2019]. However until PyEEM, no Python package existed which implemented these standard correction steps. Furthermore, the Matlab and R implementations impose metadata schemas on users which limit their ability to track several important metrics corresponding with each measurement set. By providing a Python implementation, researchers will now be able to more effectively leverage Python's large scienfitic computing ecosystem when working with EEMs.
-
-In addition to the implementation of the preprocessing and correction steps, PyEEM also provides researchers with the ability to create augmented mixture and single source training data from a small set of calibration EEM measurements. The augmentation technique relies on the fact that fluorescnce spectra are linearly additive in mixtures, according to Beer's law [source]. This augmentation technique was first described in Rutherford et al., in which it was used to train a CNN to predict the concentration of single sources of pollutants in spectral mixtures [@Rutherford2020]. Additionally, augmented and synthetic data has shown promise in improving the performace of deep learning models in several fields [@Nikolenko2019]. 
-
-PyEEM provides the first open source implementation of such an augmentation technique for EEMs. PyEEM also provides plots toolbox useful in the interpretation of EEMs... [@Hansen2018]
-
 # Summary
 
-- A summary describing the high-level functionality and purpose of the software for a diverse, non-specialist audience...
-- Description of how the software enables some new research challenges to be addressed or makes addressing research challenges significantly better (e.g., faster, easier, simpler)...
-- Description of how the software is feature-complete (i.e. no half-baked solutions) and designed for maintainable extension (not one-off modifications of existing tools)...
+Fluorescence Excitation and Emission Matrix Spectroscopy (EEMs) is a popular analytical technique in environmental monitoring. In particular, it has been applied extensively to investigate the composition and concentration of dissolved organic matter (DOM) in aquatic systems [@Coble1990;@McKnight2001;@Fellman2010]. Historically, EEMs have been combined with multi-way techniques such as PCA, ICA, and PARAFAC in order to decompose chemical mixtures [@Bro1997;@Stedmon2008;@Murphy2013;@CostaPereira2018]. More recently, deep learning approaches such as convolutional neural networks (CNNs) and autoencoders have been applied to EEMs for source separation of chemical mixtures [@Cuss2016;@Peleato2018;@Ju2019;@Rutherford2020]. However, before these source separation techniques can be performed, several preprocessing and correction steps must be applied to the raw EEMs. In order to achieve comparability between studies, standard methods to apply these corrections have been developed [@Ohno2002;@Bahram2006;@Lawaetz2009;@R.Murphy2010;@Murphy2011;@Kothawala2013]. PyEEM provides a Python implementation for these standard preprocessing and correction steps for EEM measurements produced by several common spectrofluorometers.
 
-PyEEM is a python library for the preprocessing, correction, deconvolution and analysis of Excitation Emission Matrices (EEMs)...
+In addition to the implementation of the standard preprocessing and correction steps, PyEEM also provides researchers with the ability to create augmented single source and mixture training data from a small set of calibration EEM measurements. The augmentation technique relies on the fact that fluorescence spectra are linearly additive in mixtures, according to Beer's law. This augmentation technique was first described in Rutherford et al., in which it was used to train a CNN to predict the concentration of single sources of pollutants in spectral mixtures [@Rutherford2020]. Additionally, augmented and synthetic data has shown promise in improving the performance of deep learning models in several fields [@Nikolenko2019]. 
 
-- Supported instruments, example datasets
-- Metadata schema [@Hansen2018]
-- Preprocessing, corrections, and filtering: 
-  - Cropping and wavelength filtering [SOURCE]
-  - Blank subtraction [SOURCE]
-  - Scattering removal [@Bahram2006]
-    - Include Zepp 2004.
-  - Inner-filter effect correction [@Ohno2002;@Kothawala2013]
-  - Raman normalization [@Lawaetz2009;@Murphy2011]
-- Augmentation [@Rutherford2020]
-- plots [@Hansen2018]
+Finally, PyEEM provides an extensive visualization toolbox, based on Matplotlib, which is useful in the interpretation of EEM datasets. This visualization toolbox includes various ways of plotting EEMs, the visualization of the Raman scatter peak area over time, and more.
 
-# Acknowledgements
+# Statement of Need
 
-We acknowledge contributions from...
+Prior to PyEEM, no open source Python package existed to work with EEMs. However, such libraries have existed for MATLAB and R for some time [@Murphy2013;@Massicotte;Pucher2019]. By providing a Python implementation, researchers will now be able to more effectively leverage Python's large scientific computing ecosystem when working with EEMs. Furthermore, the existing libraries in MATLAB and R do not provide deep learning techniques for decomposing chemical mixtures from EEMs. These libraries provide PARAFAC methods for performing such a task. However, although this technique has been widely used for some time, it has its limitations and recent work has shown promise in using deep learning approaches. For this reason, PyEEM provides a toolbox for generating augmented training data as well as an implementation of the CNN architecture reported in Rutherford et al., which has shown to be able to successfully decompose spectral mixtures [@Rutherford2020].
 
 # References
diff --git a/pyeem/analysis/models/rutherfordnet.py b/pyeem/analysis/models/rutherfordnet.py
@@ -16,6 +16,8 @@
 )
 from tensorflow.keras.models import Sequential
 
+# from tensorflow.keras.optimizers import Adam
+
 
 class RutherfordNet:
     """The convolutional neural network (CNN) described in Rutherford et al. 2020."""
@@ -86,6 +88,12 @@ def create_model(
         default_compile_kws = dict(
             loss="mean_squared_error", optimizer="adam", metrics=["accuracy"]
         )
+        """
+        opt = Adam(learning_rate=0.0001)
+        default_compile_kws = dict(
+            loss="mean_squared_error", optimizer=opt, metrics=["accuracy"]
+        )
+        """
         compile_kws = dict(default_compile_kws, **compile_kws)
         model.compile(**compile_kws)
         return model
@@ -229,7 +237,9 @@ def get_test_data(self, dataset, routine_results_df):
         """
         test_samples_df = self._isolate_test_samples(dataset, routine_results_df)
 
-        sources = test_samples_df.index.get_level_values("source").unique().values
+        sources = (
+            test_samples_df.index.get_level_values("source").unique().dropna().values
+        )
         sources = np.delete(sources, np.where(sources == "mixture"))
 
         X = []

diff --git a/pyeem/augmentation/base.py b/pyeem/augmentation/base.py
@@ -40,7 +40,7 @@ def prototypical_spectrum(dataset, source_df):
     )
 
     proto_eems = []
-    for index, row in source_df.iterrows():
+    for index, row in source_df[source_df["prototypical_sample"]].iterrows():
         eem_path = row["hdf_path"]
         eem = pd.read_hdf(dataset.hdf, key=eem_path)
         proto_eems.append(eem)
@@ -51,11 +51,13 @@ def prototypical_spectrum(dataset, source_df):
         "concentration"
     ].mean()
 
+    """
     weights = []
     for i in range(len(proto_eems)):
         weights.append(random.uniform(0, 1))
-
     proto_eem = np.average([eem.values for eem in proto_eems], axis=0, weights=weights)
+    """
+    proto_eem = np.average([eem.values for eem in proto_eems], axis=0)
 
     proto_eem = pd.DataFrame(
         data=proto_eem, index=proto_eems[0].index, columns=proto_eems[0].columns

diff --git a/pyeem/instruments/MIT/__init__.py b/pyeem/instruments/MIT/__init__.py
@@ -0,0 +1,4 @@
+from .remora import Remora
+
+name = "MIT"
+instruments = [Remora]
diff --git a/pyeem/instruments/MIT/remora.py b/pyeem/instruments/MIT/remora.py
@@ -0,0 +1,83 @@
+import pandas as pd
+
+
+class Remora:
+    """The MIT REMORA, a field compact deployable spectrofluorometer."""
+
+    manufacturer = "MIT"
+    """Name of Manufacturer."""
+
+    name = "REMORA"
+    """Name of Instrument."""
+
+    supported_models = ["REMORA-V1"]
+    """List of supported models."""
+
+    def __init__(self, model, sn=None):
+        """
+        Args:
+            model (str): The model name of the instrument.
+            sn (str or int, optional): The serial number of the instrument.
+                Defaults to None.
+        """
+        self.model = model
+        self.sn = sn
+
+    @staticmethod
+    def load_eem(filepath):
+        """Loads an Excitation Emission Matrix which is generated by the instrument.
+
+        Args:
+            filepath (str): The filepath of the data file.
+
+        Returns:
+            pandas.DataFrame: An Excitation Emission Matrix.
+        """
+        eem_df = pd.read_csv(filepath, index_col=0)
+        eem_df.columns = eem_df.columns.astype(float)
+        eem_df = eem_df.sort_index(axis=0)
+        eem_df = eem_df.sort_index(axis=1)
+        eem_df.index.name = "emission_wavelength"
+        return eem_df
+
+    def load_absorbance(filepath):
+        """Loads an absorbance spectrum which is generated by the instrument.
+
+        Args:
+            filepath (str): The filepath of the data file.
+
+        Returns:
+            pandas.DataFrame: An absorbance spectrum.
+        """
+        absorb_df = pd.read_csv(filepath, index_col=0)
+        absorb_df.index.name = "excitation_wavelength"
+        absorb_df.sort_index(axis=0)
+        absorb_df.index = absorb_df.index.astype("float64")
+        return absorb_df
+
+    def load_water_raman(filepath):
+        """Loads a water Raman spectrum which is generated by the instrument.
+
+        Args:
+            filepath (str): The filepath of the data file.
+
+        Returns:
+            pandas.DataFrame: An absorbance spectrum.
+        """
+        raman_df = pd.read_csv(filepath, index_col=0)
+        raman_df.columns = raman_df.columns.astype(float)
+        raman_df = raman_df.sort_index(axis=0)
+
+        raman_df = raman_df.rename(columns={raman_df.columns[0]: "intensity"})
+        raman_df.index.name = "emission_wavelength"
+        return raman_df
+
+    @staticmethod
+    def load_spectral_corrections():
+        """TODO - Should load instrument specific spectral corrections which will
+        be used in data preprocessing.
+
+        Raises:
+            NotImplementedError: On the TODO list...
+        """
+        raise NotImplementedError()
diff --git a/pyeem/instruments/__init__.py b/pyeem/instruments/__init__.py
@@ -1,6 +1,13 @@
-from . import agilent, horiba, tecan
+from . import MIT, agilent, horiba, tecan
 from .base import _get_dataset_instruments_df, get_supported_instruments
 
 supported, _supported = get_supported_instruments()
 
-__all__ = ["agilent", "horiba", "tecan", "get_supported_instruments", "supported"]
+__all__ = [
+    "agilent",
+    "horiba",
+    "tecan",
+    "MIT",
+    "get_supported_instruments",
+    "supported",
+]
diff --git a/pyeem/instruments/base.py b/pyeem/instruments/base.py
@@ -1,6 +1,6 @@
 import pandas as pd
 
-from . import agilent, horiba, tecan
+from . import MIT, agilent, horiba, tecan
 
 
 def get_supported_instruments():
@@ -17,6 +17,7 @@ def get_supported_instruments():
         agilent.name: agilent.instruments,
         horiba.name: horiba.instruments,
         tecan.name: tecan.instruments,
+        MIT.name: MIT.instruments,
     }
     # instruments = [Aqualog, Fluorolog, Cary]
     df = pd.DataFrame()

diff --git a/pyeem/plots/augmentations.py b/pyeem/plots/augmentations.py
@@ -157,7 +157,7 @@ def single_source_animation(
     max_val = ss_np.max()
 
     default_plot_kws = dict(vmin=min_val, vmax=max_val)
-    plot_kws = dict(default_fig_kws, **plot_kws)
+    plot_kws = dict(default_plot_kws, **plot_kws)
 
     default_kwargs = dict(zlim_min=min_val, zlim_max=max_val, title=None)
     kwargs = dict(default_kwargs, **kwargs)

diff --git a/setup.py b/setup.py
@@ -32,7 +32,7 @@
         "numpy<1.19.0,>=1.18.5",
         "pandas>=1.0.5",
         "xlrd >= 1.0.0",
-        "h5py>=2.10.0",
+        "h5py<2.11.0,>=2.10.0",
         "tables>=3.6.1",
         "matplotlib>=3.3.0",
         "celluloid>=0.2.0",

diff --git a/tests/test_instruments.py b/tests/test_instruments.py
@@ -4,7 +4,7 @@
 
 
 class TestInstruments:
-    manufacturers = ["Agilent", "Horiba", "Tecan"]
+    manufacturers = ["Agilent", "Horiba", "Tecan", "MIT"]
     """
     manuf_instruments = {
         pyeem.instruments.agilent.name: pyeem.instruments.agilent.instruments,