update docs

deephdc · Feb 10, 2023 · feb75a4 · feb75a4
1 parent d1f3994
commit feb75a4
Show file tree

Hide file tree

Showing 5 changed files with 171 additions and 19 deletions.
diff --git a/source/user/howto/develop-model.rst b/source/user/howto/develop-model.rst
@@ -18,11 +18,15 @@ You only need to go to the Dashboard, select the **DEEP Development environment*
 configure the Docker image and resources you want to use
 (see `video demo <https://www.youtube.com/watch?v=J_l_xWiBGNA&list=PLJ9x9Zk1O-J_UZfNO2uWp2pFMmbwLvzXa&index=3>`__).
 
+If you are new to Machine Learning, you might want to check some
+:doc:`useful Machine Learning resources <../others/useful-ml-resources>` we compiled to help you getting started.
+
 .. admonition:: Requirements
 
     * If you plan to use the **DEEP Development environment**, you need  a `DEEP-IAM <https://iam.deep-hybrid-datacloud.eu/>`__ account to be able to access the Dashboard.
     * For **Step 7** we recommend having `docker <https://docs.docker.com/install/#supported-platforms>`__ installed (though it's not strictly mandatory).
 
+
 1. Setting the framework
 ------------------------
 

diff --git a/source/user/howto/train-model-locally.rst b/source/user/howto/train-model-locally.rst
@@ -15,6 +15,9 @@ In this tutorial we will see how to retrain a `generic image classifier <https:/
 on a custom dataset to create a `phytoplankton classifier <https://github.com/deephdc/DEEP-OC-phytoplankton-classification-tf>`__.
 If you want to follow along, you can download the toy phytoplankton dataset :fa:`download` `here <https://api.cloud.ifca.es:8080/swift/v1/public-datasets/phytoplankton-mini.zip>`__.
 
+If you are new to Machine Learning, you might want to check some
+:doc:`useful Machine Learning resources <../others/useful-ml-resources>` we compiled to help you getting started.
+
 .. admonition:: Requirements
 
     * having `Docker <https://www.docker.com>`__ installed. For an up-to-date installation please follow

diff --git a/source/user/howto/train-model-remotely.rst b/source/user/howto/train-model-remotely.rst
@@ -17,6 +17,9 @@ In this tutorial we will see how to retrain a `generic image classifier <https:/
 on a custom dataset to create a `phytoplankton classifier <https://github.com/deephdc/DEEP-OC-phytoplankton-classification-tf>`__.
 If you want to follow along, you can download the toy phytoplankton dataset :fa:`download` `here <https://api.cloud.ifca.es:8080/swift/v1/public-datasets/phytoplankton-mini.zip>`__.
 
+If you are new to Machine Learning, you might want to check some
+:doc:`useful Machine Learning resources <../others/useful-ml-resources>` we compiled to help you getting started.
+
 .. admonition:: Requirements
 
     * You need  a `DEEP-IAM <https://iam.deep-hybrid-datacloud.eu/>`__ account to be able to access the Dashboard and Nextcloud storage.
@@ -61,7 +64,12 @@ Again, the folder structure and their content will of course depend on the modul
 This structure is just an example in order to complete the workflow for this tutorial.
 
 Once you have prepared your data locally, you can drag your folder to the Nextcloud Web UI to upload it.
-jhdfjhdjfhhdjfhjdhfhjfjdhfjhdjfhjdhfjhdfhjdfjhdjfhjdhfjdhfjhdfjhdjhfjdhfjhjdhfjhdjfhjdfhjdhfjhdjfhjdhfjdhfjhdjfhdjfhjhfjdhdjfhjdhfjdhfjhdfjhjdfhjdfhjdfjhdjfhdjfhjdhfjdhfjhdjfhjdfjfdhjhfdjhjfhdjhfjdhjdhfjhdjfhdjfhjdfjdhfjhdfjhjdfhjhjfdhjhfjdhfjhfdjhfjdhjdhfjhfdjdhfjhdjfjhjhfdjhfdhjhfdjhfjdhfjdhfjhjdfhjdfjdfhfdjhfdjhjdhfjhfd
+
+If you have your dataset in a remote machine, you will have to
+:ref:`install rclone <user/howto/rclone:Installing rclone>` on your remote machine,
+:ref:`configure it <user/howto/rclone:Configuring rclone>`
+and do an :ref:`rclone copy <user/howto/rclone:Using rclone>` to move your data to Nextcloud.
+
 .. tip::
 
     Uploading to Nextcloud can be particularly slow if your dataset is composed of lots of small files.

diff --git a/source/user/index.rst b/source/user/index.rst
@@ -85,17 +85,11 @@ Develop a model (advanced user)
    Develop a model <howto/develop-model>
 
 Others
-^^^^^^
+------
 
 .. toctree::
-   :maxdepth: 1
-
-   Video demos <howto/video-demos>
-
-More
-----
-
-.. toctree::
-   :maxdepth: 1
+   :maxdepth: 2
 
-   Modules <modules/index>
+   Useful Machine Learning resources <others/useful-ml-resources>
+   Video demos <others/video-demos>
+   Modules <others/modules>
diff --git a/source/user/others/useful-ml-resources.rst b/source/user/others/useful-ml-resources.rst
@@ -1,10 +1,153 @@
 Useful Machine Learning resources
 =================================
 
-..
-    todo:
-    Add:
-    * paperswithcode
-    * dataset versioning dvc
-    * long term dataset storage: zenodo, S3
-    * DL tutorials
+This is a piece of documentation trying to offer some advice on tools to 
+use to answer common problems (non ML expert) users might face.
+
+
+Tutorials
+---------
+
+Here are some basic resources to get you quickly started in the Deep Learning / Machine Learning world.
+
+Books
+^^^^^
+
+* *Deep Learning with Python*, F. Chollet
+* *The FastAI book*
+* *Deep Learning Book*, Ian Goodfellow  
+
+
+Courses
+^^^^^^^
+
+* `Google Machine Learning Crash Course <https://developers.google.com/machine-learning/crash-course>`__
+* `Machine Learning Mastery <https://machinelearningmastery.com/start-here/>`__
+* `DAIR ML YouTube Courses <https://github.com/dair-ai/ML-YouTube-Courses>`__
+
+
+
+Datasets
+--------
+
+Dataset labeling
+^^^^^^^^^^^^^^^^
+
+Some tools to help you getting started creating your dataset.
+
+* `LabelStudio <https://labelstud.io/>`__ - General annotation (text, images, etc)
+* `LabelImg <https://github.com/tzutalin/labelImg>`__ - Image annotation
+* `refinery <https://github.com/code-kern-ai/refinery>`__ - Labeling for NLP
+* `superintendent <https://github.com/janfreyberg/superintendent>`__ - ipywidget-based interactive labelling tool for your data.
+* `Roboflow <https://roboflow.com/annotate>`__ - only free is your dataset is public
+* `Labelbox <https://labelbox.com/>`__ - paid tool
+
+
+Find a dataset
+^^^^^^^^^^^^^^
+
+If you don't have any data, try find an open dataset that suits you.
+
+* `Google Dataset search <https://datasetsearch.research.google.com/>`__
+* `Graviti Open Datassets <https://gas.graviti.com/open-datasets>`__
+* `DataHub <https://datahub.io/collections>`__
+* `Kaggle <https://www.kaggle.com/>`__
+* `Paperwithcode Datasets <https://paperswithcode.com/datasets>`__
+
+
+Explore your dataset
+^^^^^^^^^^^^^^^^^^^^
+
+Less make sure the dataset does not contain errors.
+
+* `Google's Know your data <https://knowyourdata.withgoogle.com/>`__ - only valid for common Tensorflow Datasets
+* `Sweetviz <https://github.com/fbdesignpro/sweetviz>`__ - explore and compare tabular data
+* `cleanlab <https://github.com/cleanlab/cleanlab>`__ - dataset cleaning
+* `FastDup <https://github.com/visualdatabase/fastdup>`__ - dataset cleaning. Find anomalies, duplicate and near duplicate images, clusters of similarity, broken images, image statistics, wrong labels.
+* `deepchecks <https://github.com/deepchecks/deepchecks>`__ - checks related to various types of issues, such as model performance, data integrity, distribution mismatches, and more.
+* `kangas <https://github.com/comet-ml/kangas>`__ -  exploring, analyzing, and visualizing large-scale multimedia data
+* `Impyute <https://github.com/eltonlaw/impyute>`__ - missing data
+
+
+
+Feature selection
+^^^^^^^^^^^^^^^^^
+
+Some times less is more. Learn how to select the appropriate features of your dataset.
+
+* `sklearn - feature selection <https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection>`__
+* `mlxtend <https://rasbt.github.io/mlxtend/>`__
+
+
+Imbalanced learning
+^^^^^^^^^^^^^^^^^^^
+
+Do you have too much data from one class and too few from others. Let's balance things out!
+
+* `Sklearn imbalanced <https://github.com/scikit-learn-contrib/imbalanced-learn>`__
+
+
+Data augmentation
+^^^^^^^^^^^^^^^^^
+
+Do you have few data? Make the most out of it!
+
+* `Augly <https://github.com/facebookresearch/AugLy>`__ - General augmentation (text, images, etc.)
+* `imgaug <https://github.com/aleju/imgaug>`__ - Image augmentation
+
+
+Dataset shift
+^^^^^^^^^^^^^
+
+Is your dataset likely to degrade over time (eg. cam gets dirty). Keep on eye on it!
+
+* `Alibi-detect <https://github.com/SeldonIO/alibi-detect>`__
+* `Avalanche <https://github.com/ContinualAI/avalanche>`__ - Continual Learning library based on Pytorch
+* `River <https://github.com/online-ml/river>`__ - Online learning
+* `Frouros <https://github.com/IFCA/frouros>`__ - Jaime's library
+* `Cinnamon <https://github.com/zelros/cinnamon>`__
+* `Eurybia <https://github.com/MAIF/eurybia>`__
+
+
+Models
+------
+
+Model development
+^^^^^^^^^^^^^^^^^
+
+If you want to develop a model from scratch don't try to be a hero!
+`Papers with Code <https://paperswithcode.com/>`__ gathers top performing models
+for multiple tasks with their corresponding code. Reuse them for your usecases! Try not to look
+for the top model but for the one with the cleanest code.
+
+
+Training monitoring
+^^^^^^^^^^^^^^^^^^^
+
+Let's keep an eye on the training status.
+
+* `Tensorboard <https://github.com/tensorflow/tensorboard>`__ - only works with Tensorflow
+* `TensorboardX <https://github.com/lanpa/tensorboardX>`__ - framework agnostic
+* `LabML <https://github.com/labmlai/labml>`__
+
+
+Training debugging
+^^^^^^^^^^^^^^^^^^
+
+Is your training failing for some reason?
+
+* `Netron <https://github.com/lutzroeder/netron>`__ - visualize DL models 
+* `Cockpit <https://github.com/f-dangel/cockpit>`__ - debug training
+
+
+Model optimization
+^^^^^^^^^^^^^^^^^^
+
+Do you need your model to go faster?
+
+* `VoltaML <https://github.com/VoltaML/voltaML>`__ - accelerate ML models with a single line of code
+* `sparse-ml <https://github.com/neuralmagic/sparseml>`__
+* `deep-sparse <https://github.com/neuralmagic/deepsparse>`__
+* `Pytorch quantization <https://pytorch.org/docs/stable/quantization.html>`__
+* `AItemplate <https://github.com/facebookincubator/AITemplate>`__ - transforms deep neural networks into CUDA (NVIDIA GPU) / HIP (AMD GPU) C++ code for lightning-fast inference serving
+* `Hummingbird <https://github.com/microsoft/hummingbird>`__ - transform traditional Ml models (eg. Random Forest) to neural networks, and benefit from hardware acceleration