# Astrostatistics (2022)

**Davide Gerosa (University of Milano-Bicocca and INFN)**

* Astrostatistics *(Noun)* The application of statistics to the study and analysis of astronomical or astrophysical data*


# Sec 1. Introduction

![](https://mms.businesswire.com/media/20210929005835/en/911394/5/data-never-sleeps-9.0-1200px.jpg?download=1)

The amount of raw information that is being generated every minute is extraordinary. And astronomy is no exception! Weither for **"big data"** or **"small data"**, a proper statistical treatment that accounts for statistical and systematic noise, as well as as signal dependencies on the measured output, is an essential piece to discovery. 

---

What kinds of things can we learn from data, and how do we do it? 

*What* we can learn is really dependent on your goal, but this must align with the information content of the data. *How* we can interact with data is what **Data Mining** and **Machine Learning** are all about.

* **Data mining** is exactly what it sounds like: sifting through piles of data in order to find something useful---like digging rock from the ground and extracting metal ores from it. It is sometimes called "knowledge discovery", since the emphasis is on techniques and attempts to find patterns in structured data.
* **Machine learning** is about how to do this using computers to leverage our ability to extract useful information from the data by statistically comparing data to various models. The techniques are sometimes called "statistical inference", encompassing regression and model selection. 

---

Who does data mining and uses machine learning?  About everyone and for about everything. Some examples from the real world (but there are so many!):

- Amazon to predict things that you might buy or ads that you might like, https://phys.org/news/2019-06-amazon-tracking.html
- Google for everything I guess, but hers about self-driving cars: http://dataconomy.com/how-data-science-is-driving-the-driverless-car/
- Netflix to predict what shows you are likely to want to watch: https://en.wikipedia.org/wiki/Netflix_Prize, https://www.wired.co.uk/article/how-do-netflixs-algorithms-work-machine-learning-helps-to-predict-what-viewers-will-like
- Insurance companies to predict how much of a risk it is to insure you
- Financial institutions to predict the future prices of their investments
- Election prognosticators, e.g., http://fivethirtyeight.com/
- Sports teams e.g., https://en.wikipedia.org/wiki/Moneyball


And, of course, **physicists and astronomers to study the world around us!**


---

### What is this class?

* An introduction to *practical* statistical inference and data analysis.
* *Practical*: This is important! One does not understand how to treat scientific data by reading equations on the blackboard: you will need to get your hands dirty (and this is the fun part!).
* While skewed towards astrophysics, many of the techniques we will look at are general.

### Why this class?

We're first of all astrophysics, not data scientists or statisticians. **Having some knowledge of statistical inference, machine learning, and data mining is *absolutely essential* in today's modern astrophysics research.** There's no way around it, I think.

This figure from *drewconway.com* nicely illustrates the goal here. 
- You've had many of the green classes already (calculus, advanced maths methods)
- You're into an entire degree of blue classes (astrophysics in this case)
- And perhaps like to play with the red things as well (an hacking project you want to share?)

This class is an attempt to put everything together and go as close to the middle as possible.

![http://static1.squarespace.com/static/5150aec6e4b0e340ec52710a/t/51525c33e4b0b3e0d10f77ab/1364352052403/Data_Science_VD.png?format=750w](http://static1.squarespace.com/static/5150aec6e4b0e340ec52710a/t/51525c33e4b0b3e0d10f77ab/1364352052403/Data_Science_VD.png?format=750w)

---

### My research interests

My own interest in the topic stems from my scientific research in **gravitational-wave astronomy** and **black-hole binary dynamics**. The current experiments are [LIGO](https://www.ligo.caltech.edu/) and [Virgo](https://www.virgo-gw.eu/), soon to be joined in the future by the space-borne [LISA](https://lisa.nasa.gov/) detector. Some of you are going to have a full class on gravitational-wave astronomy! These instruments extract detailed information about GW-emitting systems from **signals buried deep, deep within noise**. The techniques required to do this are at the forefront of data-based inferece, relying heavily on **Bayesian techniques**. 

Gravitational-wave data exploitation is in a particularly exciting phase now, bordering between the small data and the big data regime

![](https://journals.aps.org/prd/article/10.1103/PhysRevD.100.064060/figures/4/medium)

Figure from: *Gravitational-wave detection rates for compact binaries formed in isolation: LIGO/Virgo O3 and beyond*,  Baibhav, Berti, **Gerosa** et al. Physical Review D 100 (2019) 064060.

![](https://journals.aps.org/prd/article/10.1103/PhysRevD.98.083017/figures/9/medium)

Figure from: *Mining Gravitational-wave Catalogs To Understand Binary Stellar Evolution: A New Hierarchical Bayesian Framework*,  Taylor, **Gerosa** et al. Physical Review D 98 (2018) 083017.

![](https://journals.aps.org/prd/article/10.1103/PhysRevD.102.103020/figures/2/medium)

Figure from: *Gravitational-wave selection effects using neural-network classifiers*, **Gerosa**, Pratten, Vecchio. Physical Review D 102 (2020) 103020. 

![](https://journals.aps.org/prd/article/10.1103/PhysRevD.104.044065/figures/9/medium)

Figure from: *Bayesian parameter estimation of stellar-mass black-hole binaries with LISA*, **Buschiccio** et al. (incl. **Gerosa**). Physical Review D 104 (2021) 044065.  


But there is so much more room for astrostatistics!

There is a new generation of astronomical sky surveys like the [Legacy Survey of Space and Time (LSST)](https://www.lsst.org/) of the Vera Rubin Observatory. LSST is a project that is going to generate about **200 PB** of data (that's 200 million GB) by the end of its 10 year mission.  During that time, it will have measured a hundred or more properties for some 40 billion objects---*every 3 nights*.   


Moving to future experiments, the [Square Kilometre Array](https://www.skatelescope.org/) will be the premier radio observatory of the late 2020s and 2030s, consisting of ~3000 dishes, each 15 meters wide, and spread over South Africa and Australia. In a single day, it will generate raw data amounting to an ***exobyte***, which is more than the entire current daily internet traffic.

Even that is small potatoes for particle physicists. In particle physics, they throw away *most* of their data because there is so much.  Instead they have the notion of a *trigger*, which is basically as series of "if-then" statements that decide whether or not an "event" is worth saving (for future analsyis) or not.

---

### Broadly speaking...


Almost everything that we will do can be categorized into one of two different pairs of things.

- **Supervised learning** vs. **unsupervised learning**
  - It's the learning algorithm that is being supervised, not you :-)
  - Unsupervised learning is associated with data mining and knowledge discovery. It is exploratory data analysis, learning qualitative features of structured/labeled data that were not previously known.  
  - Supervised learning is associated with machine learning and statistical inference. We might know the "truth" for (some) of the data that we are analyzing, which is "supervising" and guiding the learning process. We might have a physical model (or models) for the phenomenon against which we are fitting data, or using the data to selecting between physical models. 
  

  
- **Classification** vs. **Regression**
  - Classification means that we are trying to put our data into different discrete categories, e.g. **model selection**.
  - Regression is the limit where the classification "bins" become continuous, e.g. **parameter estimation**.
  
These can be combined together:
 * supervised classification
 * unsupervised classification (aka clustering)
 * supervised regression
 * unsupervised regression (aka dimensional reduction)

Graphically the course can be represented as a tour of the following [flowchart](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html):
![](http://scikit-learn.org/stable/_static/ml_map.png)

# IT setup