# Astrostatistics

**Davide Gerosa (University of Milano-Bicocca and INFN)**

* Astrostatistics *(Noun)* The application of statistics to the study and analysis of astronomical or astrophysical data*

This course is based on previous work by many people. See [here]((https://github.com/dgerosa/astrostatistics_bicocca_2022/blob/main/README.md) for credits.


# Introduction

![](https://mms.businesswire.com/media/20210929005835/en/911394/5/data-never-sleeps-9.0-1200px.jpg?download=1)

The amount of raw information that is being generated every minute is extraordinary. And astronomy is no exception! Weither for **"big data"** or **"small data"**, a proper statistical treatment that accounts for statistical and systematic noise, as well as as signal dependencies on the measured output, is an essential piece to discovery. 

---

What kinds of things can we learn from data, and how do we do it? 

*What* we can learn is really dependent on your goal, but this must align with the information content of the data. *How* we can interact with data is what **Data Mining** and **Machine Learning** are all about.

* **Data mining** is exactly what it sounds like: sifting through piles of data in order to find something useful---like digging rock from the ground and extracting metal ores from it. It is sometimes called "knowledge discovery", since the emphasis is on techniques and attempts to find patterns in structured data.
* **Machine learning** is about how to do this using computers to leverage our ability to extract useful information from the data by statistically comparing data to various models. The techniques are sometimes called "statistical inference", encompassing regression and model selection. 

---

Who does data mining and uses machine learning?  About everyone and for about everything. Some examples from the real world (but there are so many!):

- Amazon to predict things that you might buy or ads that you might like, https://phys.org/news/2019-06-amazon-tracking.html
- Google for everything I guess, but hers about self-driving cars: http://dataconomy.com/how-data-science-is-driving-the-driverless-car/
- Netflix to predict what shows you are likely to want to watch: https://en.wikipedia.org/wiki/Netflix_Prize, https://www.wired.co.uk/article/how-do-netflixs-algorithms-work-machine-learning-helps-to-predict-what-viewers-will-like
- Insurance companies to predict how much of a risk it is to insure you
- Financial institutions to predict the future prices of their investments
- Election prognosticators, e.g., http://fivethirtyeight.com/
- Sports teams e.g., https://en.wikipedia.org/wiki/Moneyball


And, of course, **physicists and astronomers to study the world around us!**


---

### What is this class?

* An introduction to *practical* statistical inference and data analysis.
* *Practical*: This is important! One does not understand how to treat scientific data by reading equations on the blackboard: you will need to get your hands dirty (and this is the fun part!).
* While skewed towards astrophysics, many of the techniques we will look at are general.

### Why this class?

We're first of all astrophysics, not data scientists or statisticians. **Having some knowledge of statistical inference, machine learning, and data mining is *absolutely essential* in today's modern astrophysics research.** There's no way around it, I think.

This figure from *drewconway.com* nicely illustrates the goal here. 
- You've had many of the green classes already (calculus, advanced maths methods)
- You're into an entire degree of blue classes (astrophysics in this case)
- And perhaps like to play with the red things as well (an hacking project you want to share?)

This class is an attempt to put everything together and go as close to the middle as possible.

![http://static1.squarespace.com/static/5150aec6e4b0e340ec52710a/t/51525c33e4b0b3e0d10f77ab/1364352052403/Data_Science_VD.png?format=750w](http://static1.squarespace.com/static/5150aec6e4b0e340ec52710a/t/51525c33e4b0b3e0d10f77ab/1364352052403/Data_Science_VD.png?format=750w)

---

### My research interests

My own interest in the topic stems from my scientific research in **gravitational-wave astronomy** and **black-hole binary dynamics**. The current experiments are [LIGO](https://www.ligo.caltech.edu/) and [Virgo](https://www.virgo-gw.eu/), soon to be joined in the future by the space-borne [LISA](https://lisa.nasa.gov/) detector. Some of you are going to have a full class on gravitational-wave astronomy! These instruments extract detailed information about GW-emitting systems from **signals buried deep, deep within noise**. The techniques required to do this are at the forefront of data-based inferece, relying heavily on **Bayesian techniques**. 

Gravitational-wave data exploitation is in a particularly exciting phase now, bordering between the small data and the big data regime

![](https://journals.aps.org/prd/article/10.1103/PhysRevD.100.064060/figures/4/medium)

Figure from: *Gravitational-wave detection rates for compact binaries formed in isolation: LIGO/Virgo O3 and beyond*,  Baibhav, Berti, **Gerosa** et al. Physical Review D 100 (2019) 064060.

![](https://journals.aps.org/prd/article/10.1103/PhysRevD.98.083017/figures/9/medium)

Figure from: *Mining Gravitational-wave Catalogs To Understand Binary Stellar Evolution: A New Hierarchical Bayesian Framework*,  Taylor, **Gerosa** et al. Physical Review D 98 (2018) 083017.

![](https://journals.aps.org/prd/article/10.1103/PhysRevD.102.103020/figures/2/medium)

Figure from: *Gravitational-wave selection effects using neural-network classifiers*, **Gerosa**, Pratten, Vecchio. Physical Review D 102 (2020) 103020. 

![](https://journals.aps.org/prd/article/10.1103/PhysRevD.104.044065/figures/9/medium)

Figure from: *Bayesian parameter estimation of stellar-mass black-hole binaries with LISA*, **Buschiccio** et al. (incl. **Gerosa**). Physical Review D 104 (2021) 044065.  


But there is so much more in astrostatistics!

There is a new generation of astronomical sky surveys like the [Legacy Survey of Space and Time (LSST)](https://www.lsst.org/) of the Vera Rubin Observatory. LSST is a project that is going to generate about **200 PB** of data (that's 200 million GB) by the end of its 10 year mission.  During that time, it will have measured a hundred or more properties for some 40 billion objects---*every 3 nights*.   


Moving to future experiments, the [Square Kilometre Array](https://www.skatelescope.org/) will be the premier radio observatory of the late 2020s and 2030s, consisting of ~3000 dishes, each 15 meters wide, and spread over South Africa and Australia. In a single day, it will generate raw data amounting to an ***exobyte***, which is more than the entire current daily internet traffic.

Even that is small potatoes for particle physicists. In particle physics, they throw away *most* of their data because there is so much.  Instead they have the notion of a *trigger*, which is basically as series of "if-then" statements that decide whether or not an "event" is worth saving (for future analsyis) or not.

---

### Broadly speaking...


Almost everything that we will do can be categorized into one of two different pairs of things.

- **Supervised learning** vs. **unsupervised learning**
  - It's the learning algorithm that is being supervised, not you :-)
  - Unsupervised learning is associated with data mining and knowledge discovery. It is exploratory data analysis, learning qualitative features of structured/labeled data that were not previously known.  
  - Supervised learning is associated with machine learning and statistical inference. We might know the "truth" for (some) of the data that we are analyzing, which is "supervising" and guiding the learning process. We might have a physical model (or models) for the phenomenon against which we are fitting data, or using the data to selecting between physical models. 
  

- **Classification** vs. **Regression**
  - Classification means that we are trying to put our data into different discrete categories, e.g. **model selection**.
  - Regression is the limit where the classification "bins" become continuous, e.g. **parameter estimation**.

![](https://www.researchgate.net/profile/Yves-Matanga/publication/326175998/figure/fig9/AS:644582983352328@1530691967314/Classification-vs-Regression.png)
(Figure from Matanga 2017)  
 

These can be combined together:
 * supervised classification
 * unsupervised classification (aka clustering)
 * supervised regression
 * unsupervised regression (aka dimensional reduction)

Graphically the course can be represented as a tour of the following [flowchart](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html):
![](http://scikit-learn.org/stable/_static/ml_map.png)

---

# Content

The course is divided into 9 sections:

1. Probability
2. Frequentist inference
3. Bayesian inference
4. Density estimation and clustering
5. Dimensional reduction
6. Regression
7. Classification
8. Time series analysis
9. Deep learning

As you see, there's a lot to cover in data mining, so we'll necessaryly only provide a broad introduction on these topics. But I hope that's enough to give you the key background to dig deep into whatever you need for your research, during your Master's thesis and beyond. 



### A huge thanks to...

This class draws heavily from many others that came before me. Credit goes to:

- Stephen Taylor (Vanderbilt University): [github.com/VanderbiltAstronomy/astr_8070_s21](github.com/VanderbiltAstronomy/astr_8070_s21).
- Gordon Richards (Drexel University): [github.com/gtrichards/PHYS_440_540](https://github.com/gtrichards/PHYS_440_540).
- Jake Vanderplas (University of Washington): [github.com/jakevdp/ESAC-stats-2014](https://github.com/jakevdp/ESAC-stats-2014).
- Zeljko Ivezic (University of Washington): [github.com/uw-astr-302-w18/astr-302-w18](https://github.com/uw-astr-302-w18/astr-302-w18).
- Andy Connolly (University of Washington): [cadence.lsst.org/introAstroML/](http://cadence.lsst.org/introAstroML).
- Karen Leighly (University of Oklahoma): [seminar.ouml.org/](http://seminar.ouml.org).
- Adam Miller (Northwestern University): [github.com/LSSTC-DSFP/LSSTC-DSFP-Sessions/](https://github.com/LSSTC-DSFP/LSSTC-DSFP-Sessions).
- Jo Bovy (University of Toronto): [astro.utoronto.ca/~bovy/teaching.html](http://astro.utoronto.ca/~bovy/teaching.html).
- Thomas Wiecki (PyMC Labs): [twiecki.github.io/blog/2015/11/10/mcmc-sampling](http://twiecki.github.io/blog/2015/11/10/mcmc-sampling).
- Aurelienne Geron (freelancer): [github.com/ageron/handson-ml2](https://github.com/ageron/handson-ml2).


### I need your help!

This is the first yeat that Bicocca offers an Astrostatistics class. I'm not entirely sure but I think it's only class of this kind in the country. 

It is also the first time I teach it. So please be kind, things are not going to be perfect...

Very important: **please do give me feedback** (what works, what doesn't work, what topics have been covered in other classes already, if I assume too much from computing skills, or if you want me to go faster, if the excercises are too hard or too easy, etc).

I plan to circulate a form for anonymous feedback later on, but please feel free to come and give me your opinions at any time!


### Get in touch!

I'm very happy to chat about the class (and more: gravity, science, career prospects, etc). My office is number 2007 at the second floor of the U2 building. Feel free to knock stop by and knock at my door (I might say I'm busy and ask you to come later...). Or send me an email for an appointment: [davide.gerosa@unimib.it](mailto:davide.gerosa@unimib.it).

---

# Logistics

Let's sort out some boring logistics details...

### Class times

- Classes are on **Monday and Wednesday at 8.30-10.30** (sorry! I didn't choose!) in **room U2-05**.
- For the next two weeks we're also going to have lectures on Tuesday at 8.30-10.30. That is: two extra lectures on March 8th and March 15th. 
- This is to optimize the schedule because (i) there are empty slots in your timetable and (ii) we will skip some lectures later in April because I'm going to be away for research committments. A few extra lectures now will allow us to still finish the class by the end of May, leaving you more time before the exam season starts.    
- Also, let's get the simple probability stuff our of the way and move on to the cool machine learning part! :-) 

---

# IT setup

### Everything happens on github: [github.com/dgerosa/astrostatistics_bicocca_2022](https://github.com/dgerosa/astrostatistics_bicocca_2022)

There you'll find all the codes, example, resources, extra material, etc.

You can follow along in three ways:


### 1. Binder environment

Right at the top you see this button: ![](https://mybinder.org/badge_logo.svg)

That will open a interactive  environment where all the software we are going to need is already installed. That's a jupyter notebook like those I will use during the lectures. (if you want to know more about the magic behind it, check out [mybinder](https://mybinder.org/) and [docker](https://www.docker.com/)).

#### Very important

Binder session are temporary. Everytime you close and re-open you session, the environment starts back from what's available on Github. So if you make changes there while following along and solving the in-class problems, **these won't be saved**.


#### It is imperative that you download the document on your machine (there's a button in the toolbar) before closing  the browser window. 

You can also save and load in the browser's cache, but it's less reliable in my opinion.

### Unimib virtual machine

My binder is the easiest way to just follow along. When more coding is involved, you can use the Bicocca virtual machine. 

FINISH ME.

### Your own python distribution

At some point in your research you're going to run python code on your own computer (maybe you have already done it for your BSc thesis? How many?). This is not necessary for my class, but you might want to give it a go at installing a python distrubition.

If you want to install python on your laptopt, I'm happy to help as I can (I don't guarantee success but I have done it before on both Mac and Linux, sorry not with Windows).

# Before we start...


![](https://imgs.xkcd.com/comics/data_trap.png)

Credit: [xkcd 2582](https://xkcd.com/2582/)