# Machine Learning and Statistics for Physicists

Material for a [UC Irvine](https://uci.edu/) course offered by the [Department of Physics and Astronomy](https://www.physics.uci.edu/).

Content is maintained on [github](github.com/dkirkby/MachineLearningStatistics) and distributed under a [BSD3 license](https://opensource.org/licenses/BSD-3-Clause).

[Table of Contents](Contents.ipynb)

## Introduction

**ACTIVITY:** Discuss these questions:
1. What is a *data scientist*?
2. What is the relationship between *machine learning* and *statistics*?
3. What is "deep" about *deep learning*?
4. Does your research focus more on *data* or *models*?

![Data-models-statistics triangle](img/Intro/MLS-triangle.png)

Further reading:
- [Data mining and statistics: what's the connection?](statweb.stanford.edu/~jhf/ftp/dm-stat.pdf)
- [The rise of the "data engineer"](https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-91be18f1e603)
- [Humorous contrasts between ML and Stats](http://statweb.stanford.edu/~tibs/stat315a/glossary.pdf)
    - python$\leftrightarrow$ R
    - conference talk$\leftrightarrow$ journal article

### How will this course be different from a CS class?

Physics and astronomy students have different preparation:
- Strong background and experience with mathematical tools (linear algebra, multivariate calculus) needed for rigorous discussion of statistics.
- Weak / varied background in traditional CS core topics of fundamental algorithms, databases, etc

Physics and astronomy research also has different needs:
- Our data and models are often fundamentally different from those in typical CS contexts.
- We ask different types of questions about our data, sometimes requiring new methods.
- We have different priorities for judging a "good" method: interpretability, error estimates, etc.

### What is Data?

Data are a finite set of measurements:
- Usually viewed as a 2D table e.g., spreadsheet, [FITS table](http://docs.astropy.org/en/stable/io/fits/usage/table.html), [ROOT tree](https://root.cern.ch/root/html/guides/users-guide/Trees.html#trees), [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)...
- **colums = features**: numeric / categorical?
- **rows = samples**: ordered? independent? identically distributed? (i.i.d.)
- measurement errors?
- binned / un-binned?
- similarity measure on samples?

**ACTIVITY:** Pick one of these ML problems and describe the rows (samples) and columns (features) of the data you might use to solve the problem.
1. Learn a fast approximation to a slow exact calculation.
2. Learn to identify Higgs particle decays from LHC event data.
3. Learn to estimate the distance to a quasar using optical images.

### What is a Model?

Models specify the probabilities of possible measurements:
- Explicit: probability density function.
- Implicit: algorithm to generate random outcomes (forward / generative model).
- Usually wrong (except "Toy MC")
- Observables (latent variables):
  - integrability: required to calculate normalized probabilities.
- Parameters (and hyper-parameters):
  - differentiability: required to find most probable (uphill) direction.
- Variance - bias tradeoffs (regularization).

### What is special about ML in physics and astronomy?

- We are data producers, not data consumers:
  - Experiment / survey design.
  - Optimization of statistical errors.
  - Control of systematic errors.
- Our data measures physical processes:
  - Measurements often reduce to counting photons, etc, with known a-priori random errors.
  - Dimensions and units are important.
- Our models are usually traceable to an underlying physical theory:
  - Models constrained by theory and previous observations.
  - Parameter values often intrinsically interesting.
- A parameter error estimate is just as important as its value:
  - Prefer methods that handle input data errors (weights) and provide output parameter error estimates.