# Intro to Machine Learning with scikit-learn

## An oft-quoted definition
> A computer program is said to learn from experience E with respect to some task T and some performance measure P if its performance on T, as measured by P, improves with experience E. 

> ( Mitchell 1997)

Example Experiences: Supervised and Unsupervised learning

Example Tasks: Classification, Regression, Clustering

Example Performance: Accuracy, F1-Score, RMSE


## Things you can do with scikit-learn
[![ml-map](src/img/ml_map.png)](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

For a full list, check out the [User Guide](http://scikit-learn.org/stable/user_guide.html).

### Example algorithms:
* Linear regression (trained with gradient descent), logistic regression
* KNN
* SVMs
* Ensemble decision tree methods: Random Forests, Gradient boosted decision trees
    * boosting vs bagging
    * see the docs: http://scikit-learn.org/stable/modules/ensemble.html
* Naive Bayes (Gaussian, Multinomial)

## Further motivation
![algo-comp](src/img/Model_comparison.jpg)
_Olson 2017 https://arxiv.org/abs/1708.05070_

### In Neuroscience
#### From Konrad Kording's group
* Encoding: [Modern machine learning outperforms GLMs at predicting spikes](https://www.biorxiv.org/content/early/2017/10/04/111450), with [code](https://github.com/KordingLab/spykesML)
![Fig4](src/img/ML_GLM_Fig4.png)
* Decoding: [Machine learning for neural decoding](https://arxiv.org/abs/1708.00909), with [code](https://github.com/KordingLab/Neural_Decoding)

#### Neural Networks
* Also an emergence of RNN (LFADS, Sussillo) and CNN papers (Yamins, Ecker) to help explain neural responses

### Examples of ensemble methods in biology
* ensemble methods are not only useful for Kaggle competitions but also in biology
* bagging, boosting, and stacking
* "Wisdom of crowds for robust gene network inference"
     > Reconstructing gene regulatory networks from high-throughput data is a long-standing challenge. Through the Dialogue on Reverse Engineering Assessment and Methods (DREAM) project, we performed a comprehensive blind assessment of over 30 network inference methods on Escherichia coli, Staphylococcus aureus, Saccharomyces cerevisiae and in silico microarray data. We characterize the performance, data requirements and inherent biases of different inference approaches, and we provide guidelines for algorithm application and development. We observed that no single inference method performs optimally across all data sets. In contrast, integration of predictions from multiple inference methods shows robust and high performance across diverse data sets. We thereby constructed high-confidence networks for E. coli and S. aureus, each comprising ∼1,700 transcriptional interactions at a precision of ∼50%. We experimentally tested 53 previously unobserved regulatory interactions in E. coli, of which 23 (43%) were supported. Our results establish community-based methods as a powerful and robust tool for the inference of transcriptional gene regulatory networks.

    > from https://www.nature.com/articles/nmeth.2016
* Other DREAM competitions, e.g. Keller et al. 2017 [Predicting human olfactory perception from chemical features of odor molecules](http://science.sciencemag.org/content/355/6327/820.long)


# The whole process

1. Preprocess your data and feature transformation / engineering
    * [sklearn.preprocessing](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
    * remove outliers and impute NaNs
    * unit mean and variance
    * convert categorial variables to numerical (one-hot encondings to binary) 
    * log-transform
    * expanded or reduced bases, see [sklearn.decomposition](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition)
    * polynomial, interaction terms
    * also see [sklearn.feature_selection](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection)
3. Divide data into training, validation, and test sets
    * splitter classes of [sklearn.model_selection](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)
    * KFold, leave one out
4. Tune model and regularization parameters
    * search over relevant models and hyperparameters
    * hyper-parameter-optimizers in [sklearn.model_selection](http://scikit-learn.org/stable/modules/classes.html#hyper-parameter-optimizers)
        * GridSearchCV and RandomizedSearchCV 
        * Randomly sampling parameters is generally better![rand](src/img/Random_opt.png)
        * See Bergstra and Bengio 2012 [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf)
    * Fit model to training data, using validation set to assess model [score](http://scikit-learn.org/stable/modules/model_evaluation.html)
    * `model.fit()`
5. Evaluate on held-out test data
    * [sklearn.metrics](http://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics)
    * `scorer(ytrue, model.predict())`
    * evaluate predictions: confusion matrix, residuals
    * examine best model: coefficients, feature_importance
6. Repeat (in the case of nested cross-validation)
    * test new models, add or remove features

## Common API
* initialize, fit, predict, score:
```
# Class to extend Sklearn models
class SklearnHelper(object):
    def __init__(self, clf, scorer, seed=42, params={}):
        params['random_state'] = seed
        self.clf = clf(**params)
        self.scorer = scorer

    def train(self, x_train, y_train):
        self.clf.fit(x_train, y_train)

    def predict(self, x_test):
        return self.clf.predict(x_test)
    
    def score(self, x_test, y_true):
        return self.scorer(y_true, self.predict(x_test))
# Example usage:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
model = SklearnHelper(LogisticRegression, accuracy_score)
model.train(X_train, Y_train)
model.score(X_test, Y_true)
```
* For an overview of the API see [API design for machine learning software: experiences from the scikit-learn project](https://arxiv.org/pdf/1309.0238.pdf), or check the docs for the [full API](http://scikit-learn.org/stable/modules/classes.html#)

* Integrates well with other packages, eg. scipy sparse matrics (CSR, CSC), pandas DataFrames, visualization with matplotlib and seaborn

### Bias-Variance tradeoff
![bias-var](src/img/bias-variance.png)
_from http://www.brnt.eu/phd/node14.html_

Also see the example chapter from Jake VanderPlas [here](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html)

# Other Resources

## Machine learning (with scikit-learn)
Besides checking out the tutorials and examples that are part of scikit-learn's documentation I'd recommend:
* Jake VanderPlas's book, [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html#5.-Machine-Learning). All of the notebooks are also available through [Binder](https://mybinder.org/v2/gh/jakevdp/PythonDataScienceHandbook/master?filepath=notebooks%2FIndex.ipynb)
* Browsing Kaggle [kernels](https://www.kaggle.com/kernels)
* Nature Methods Points of Significance columns: http://mkweb.bcgsc.ca/pointsofsignificance/

## Books
* Hastie, Elements of Statistical Learning
* Bishop, Pattern Recognition and Machine Learning
* Murphy, Machine Learning: a Probabilistic Perspective

## Hyperparameter optimzation and AutoML
* AutoML packages: [TPOT](https://github.com/rhiever/tpot), the [AutoML](https://github.com/automl) packages like [auto-sklearn](https://github.com/automl/auto-sklearn). These packages use genetic and bayesian optimization algorithms to evaluate the "fitness" or relationship between hyperparameter settings and model performance to search both across spaces where the relationship is uncertain as well as to focus in the subspaces that perform well. Can optimize not only the hyperparameters but also the type of model and preprocessing steps.
* Bayesian optimization pacakges: [hyperopt](https://github.com/hyperopt/hyperopt), [Spearmint](https://github.com/HIPS/Spearmint), or [MOE](https://github.com/Yelp/MOE)

## Other ML libraries in python
* [XGBoost](https://github.com/dmlc/xgboost) or [LightGBM](https://github.com/Microsoft/LightGBM) for gradient boosting. This packages also have Scikit-Learn Wrappers so you can use them with GridSearch and pipelines with other sklearn algorithms.
* MLlib for Spark
* [scikit-learn-contrib](https://github.com/scikit-learn-contrib)
* feature selection algorithms, such as [scikit-rebate](https://github.com/EpistasisLab/scikit-rebate)

## Deep learning
* TensorFlow, PyTorch, MXNet