# Intro to Machine Learning

## An oft-quoted definition
> A computer program is said to learn from experience E with respect to some task T and some performance measure P if its performance on T, as measured by P, improves with experience E. 

> ( Mitchell 1997)

Example Experiences: Supervised and Unsupervised learning

Example Tasks: Classification, Regression, Clustering

Example Performance: Accuracy, F1-Score


## Things you can do with scikit-learn
[![ml-map](src/img/ml_map.png)](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

For a full list, check out the [User Guide](http://scikit-learn.org/stable/user_guide.html).

## Further motivation
![algo-comp](src/img/Model_comparison.jpg)
_Olson 2017 https://arxiv.org/abs/1708.05070_


### Example algorithms:
* Linear regression, logistic regression
* KNN
* SVMs
* Ensemble decision tree methods: Random Forests, Gradient boosted decision trees
    * boosting vs bagging
    * see the docs: http://scikit-learn.org/stable/modules/ensemble.html
* Naive Bayes (Gaussian, Multinomial)



# The whole process

## Common API

* Integrates well with other packages, eg. scipy sparse matrics (CSR, CSC), pandas DataFrames, visualization with matplotlib and seaborn

### Bias-Variance tradeoff
![bias-var](src/img/bias-variance.png)
_from http://www.brnt.eu/phd/node14.html_

Also see the example chapter from Jake VanderPlas [here](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html)

# Other Resources

## Machine learning with scikit-learn
Besides checking out the tutorials and examples that are part of scikit-learn's documentation I'd recommend:
* Jake VanderPlas's book, [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html#5.-Machine-Learning). All of the notebooks are also available through [Binder](https://mybinder.org/v2/gh/jakevdp/PythonDataScienceHandbook/master?filepath=notebooks%2FIndex.ipynb)
* Browsing Kaggle [kernels](https://www.kaggle.com/kernels)

## Books
* Hastie, Elements of Statistical Learning
* Bishop, Pattern Recognition and Machine Learning
* Murphy, Machine Learning: a Probabilistic Perspective

## Hyperparameter optimzation and AutoML
* AutoML packages: [TPOT](https://github.com/rhiever/tpot), the [AutoML](https://github.com/automl) packages like [auto-sklearn](https://github.com/automl/auto-sklearn). These packages use genetic and bayesian optimization algorithms to evaluate the "fitness" or relationship between hyperparameter settings and model performance to search both across spaces where the relationship is uncertain as well as to focus in the subspaces that perform well. Can optimize not only the hyperparameters but also the type of model and preprocessing steps.
* Bayesian optimization pacakges: [hyperopt](https://github.com/hyperopt/hyperopt), [Spearmint](https://github.com/HIPS/Spearmint), or [MOE](https://github.com/Yelp/MOE)

## Other ML libraries in python
* [XGBoost](https://github.com/dmlc/xgboost) or [LightGBM](https://github.com/Microsoft/LightGBM) for gradient boosting
* MLlib for Spark

## Deep learning
* TensorFlow, PyTorch, MXNet