# Welcome to the Material Science Machine Learning Jupyter Notebooks!

Here you will get an introduction to Material Science and Machine Learning using Python.

**These examples will most likely only work with python 3.6+**

Machine Learning while there are many algorithms and techniques the workflow is **always the same**.

1. [gather the data](1-gather-data.ipynb) (web scraping, experiments, simulations)
2. [explore the data](2-explore-sanitize-data.ipynb)
  - get a feel for what data you have
  - are there any interesting features to explore?
3. [sanitize the data](2-explore-sanitize-data.ipynb)
  - how do you handle missing data?
  - how do you handle categorical data? (for example 'metal, 'non-metal', 'spacegroup')
4. [apply machine learning algorithms](3-predicting-data.ipynb)
  - often the easiest part
5. [validate predictions](3-predicting-data.ipynb)

# How is Machine Learning different from Statistics?

While a huge generalization.

 - statistician: care about understanding how the data is generated, and understanding the model and its parameters
 - machine learning: mostly care about ability from prediction
 
I feel that scientists fall mostly in the `statisticians` camp.

Occam's razor: one should select the simplest model that describes the data.

As an example https://www.youtube.com/watch?v=1A1yaWS8gSg

Sofisticated model that predicts planet positions with circles can be replaced by a far simpler one that uses elispses. This comes from our **understanding** of the physics.

# Which Machine Learning Algorithms should I use?

There are hundreds of algorithms to choose from. Always start with the simplest so that you can just how more complex models perform.

General Fields of Machine Learning. You will notice that some algorithms appear in multiple areas.

## Classification

SVM, nearest neighbors, random forests, gradient boost, nearual networks.

Great [starting example dataset](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html)

![Clustering](http://scikit-learn.org/stable/_images/sphx_glr_plot_classification_thumb.png)

## Regression

SRV, ridge regression, Lasso, **Bayession Methods**, neural networks

![linear regression](http://scikit-learn.org/stable/_images/sphx_glr_plot_ols_0011.png)

I would like to highlight how awesome bayession methods are. [pymc3](http://docs.pymc.io/) is the python package to use. If you can create a model that describes your data you can use bayessian methods. It not gaussian processes are amazing (they are "parameter free" fitting methods.

Gaussian process. Notice how you get the variance of your prediction with your data.

![Gaussian Process](https://blog.dominodatalab.com/wp-content/uploads/2017/03/output_57_0-1.png)

Bayessian Methods predicting the effect of regulation on coal miner deaths.

![coal miner deaths](http://docs.pymc.io/_images/notebooks_getting_started_52_0.png)

I am not very knowledgable on neural networks but [pytorch](https://pytorch.org/) is the most userfriendly way to get started.

Play with neural networks in your browser to get a feel for them. [link](http://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.46804&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false)


## Clustering

k-Means, spectral clustering, mean-shift

![effect of cluster size etc](http://scikit-learn.org/stable/_images/sphx_glr_plot_kmeans_assumptions_001.png)

# Bias Variance Tradeoff

 - bias error: error of model with training data
 - variance error: error of model with a different set of training data
 - irreducible error: error that cannot be reduced regardless of algorithm (sometimes noise)

![bias variance](http://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png)

# How do we tell where we are on the bias variance curve?

[Cross Validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)): split your data into a traning and test set. Use the training set to fit your model. Use the test set to evaluate the performance of your model.

Often times you split your data 90% training, 10% testing.

sklearn provides many methods for automating this.


# Additional Resources

 - [great introduction](https://docs.google.com/presentation/d/1O6ozzZHHxGzU-McpvEG09hl7K6oQDd2Taw0FOlnxJc8/edit?usp=docslist_api)
 - [Kaggle](https://www.kaggle.com/) competitions that teach you how to use machine learning (best way to learn is to apply)
 - [fast.ai](www.fast.ai) the place to learn about neural networks
 - coursera, edx, udacity too many to name