# Astrostatistics and Machine Learning (F5802Q020)

**Davide Gerosa (University of Milano-Bicocca and INFN)**

* Astrostatistics *Greta* The application of statistics to the study and analysis of astronomical or astrophysical data

This course is based on previous work by many people. See [here]((https://github.com/dgerosa/astrostatistics_bicocca_2024/blob/main/README.md) for credits.

---

# Introduction

![](https://web-assets.domo.com/blog/wp-content/uploads/2023/12/23-dns11-FINAL-1.png)

The amount of raw information that is being generated every minute is extraordinary. And astronomy is no exception! Weither for **"big data"** or **"small data"**, a proper statistical treatment that accounts for statistical and systematic noise, as well as as signal dependencies on the measured output, is an essential piece to discovery. 

---

What kinds of things can we learn from data, and how do we do it? 

*What* we can learn is really dependent on your goal, but this must align with the information content of the data. *How* we can interact with data is what **Data Mining** and **Machine Learning** are all about.

* **Data mining** is exactly what it sounds like: sifting through piles of data in order to find something useful---like digging rock from the ground and extracting metal ores from it. It is sometimes called "knowledge discovery", since the emphasis is on techniques and attempts to find patterns in structured data.
* **Machine learning** is about how to do this using computers to leverage our ability to extract useful information from the data by statistically comparing data to various models. The techniques are sometimes called "statistical inference", encompassing regression and model selection. 

---

Who does data mining and uses machine learning?  About everyone and for about everything. Some examples from the real world (but there are so many!):

- Amazon to predict things that you might buy or ads that you might like, https://phys.org/news/2019-06-amazon-tracking.html
- Google for everything I guess, but here is a link about self-driving cars: http://dataconomy.com/how-data-science-is-driving-the-driverless-car/
- Netflix to predict what shows you are likely to want to watch: https://en.wikipedia.org/wiki/Netflix_Prize, https://www.wired.co.uk/article/how-do-netflixs-algorithms-work-machine-learning-helps-to-predict-what-viewers-will-like
- Insurance companies to predict how much of a risk it is to insure you
- Financial institutions to predict the future prices of their investments
- Election prognosticators, e.g., http://fivethirtyeight.com/
- Sports teams e.g., https://en.wikipedia.org/wiki/Moneyball


And, of course, **physicists and astronomers to study the world around us!**


---

### What is this class?

* An introduction to *practical* statistical inference and data analysis.
* *Practical*: This is important! One does not understand how to treat scientific data by reading equations on the blackboard: you will need to get your hands dirty (and this is the fun part!).
* While examples are skewed towards astrophysics, the techniques we will look at are very general.

### Why this class?

We're first of all astrophysicists, not data scientists or statisticians. Stats is a tool, but a very important one!  **Having some knowledge of statistical inference, machine learning, and data mining is *absolutely essential* in today's modern astrophysics research.** There's really no way around it, I think.

This figure from *drewconway.com* nicely illustrates the goal here. 
- You've had many of the green classes already (calculus, advanced maths methods)
- You're into an entire degree of blue classes (astrophysics in this case)
- And perhaps like to play with the red things as well (an hacking project you want to share?)

This class is an attempt to put everything together and go as close to the middle as possible.

![http://static1.squarespace.com/static/5150aec6e4b0e340ec52710a/t/51525c33e4b0b3e0d10f77ab/1364352052403/Data_Science_VD.png?format=750w](http://static1.squarespace.com/static/5150aec6e4b0e340ec52710a/t/51525c33e4b0b3e0d10f77ab/1364352052403/Data_Science_VD.png?format=750w)

---

### My research interests

My own interest in the topic stems from my scientific research in **gravitational-wave astronomy** and **black-hole binary dynamics**. The current experiments are [LIGO](https://www.ligo.caltech.edu/) and [Virgo](https://www.virgo-gw.eu/), soon to be joined in the future by the space-borne [LISA](https://lisa.nasa.gov/) detector. Some of you are going to have a full class on gravitational-wave astronomy! These instruments extract detailed information about GW-emitting systems from **signals buried deep, deep within noise**. The techniques required to do this are at the forefront of data-based inferece, relying heavily on **Bayesian statistics**. 

Gravitational-wave data exploitation is in a particularly exciting phase now, bordering between the small data and the big data regime

![](https://journals.aps.org/prd/article/10.1103/PhysRevD.100.064060/figures/4/medium)

Figure from: *Gravitational-wave detection rates for compact binaries formed in isolation: LIGO/Virgo O3 and beyond*,  Baibhav, Berti, **Gerosa** et al. Physical Review D 100 (2019) 064060.

![](https://journals.aps.org/prd/article/10.1103/PhysRevD.98.083017/figures/9/medium)

Figure from: *Mining Gravitational-wave Catalogs To Understand Binary Stellar Evolution: A New Hierarchical Bayesian Framework*,  Taylor, **Gerosa** et al. Physical Review D 98 (2018) 083017.

This fit was done with Gaussian Process Regression... We'll see what this is about

![](https://journals.aps.org/prd/article/10.1103/PhysRevD.102.103020/figures/2/medium)

Figure from: *Gravitational-wave selection effects using neural-network classifiers*, **Gerosa**, Pratten, Vecchio. Physical Review D 102 (2020) 103020. 

This instead instead was done with a multilayer perceptron. We'll see what this is about...

![](https://journals.aps.org/prd/article/10.1103/PhysRevD.104.044065/figures/9/medium)

Figure from: *Bayesian parameter estimation of stellar-mass black-hole binaries with LISA*, **Buschiccio** et al. (incl. **Gerosa**). Physical Review D 104 (2021) 044065.  

And this is nested sampling for the LISA space mission... We'll see this one as well


This is a bit what I do, but there is so much more in astrostatistics!

There is a new generation of astronomical sky surveys like the [Legacy Survey of Space and Time (LSST)](https://www.lsst.org/) of the Vera Rubin Observatory. LSST is a project that is going to generate about **200 PB** of data (that's 200 million GB) by the end of its 10 year mission.  During that time, it will have measured a hundred or more properties for some 40 billion objects---*every 3 nights*.   


Moving to future experiments, the [Square Kilometre Array](https://www.skatelescope.org/) will be the premier radio observatory of the late 2020s and 2030s, consisting of ~3000 dishes, each 15 meters wide, and spread over South Africa and Australia. In a single day, it will generate raw data amounting to an ***exobyte***, which is more than the entire current daily internet traffic.

Even that is small potatoes for particle physicists. In particle physics, they throw away *most* of their data because there is so much.  Instead they have the notion of a *trigger*, which is basically as series of "if-then" statements that decide whether or not an "event" is worth saving (for future analsyis) or not.

---

### Your research interests

Enough about me, how about *your* interests?

- Are you all enrolled in the Astrophysics MSc degree? Whay year?
- Do we have people from Physics as well? If yes, what curriculum? People from other degrees?
- Why did you pick this class? (I bet not because it's going to be easy...)
- What do you want from it? (so at the end we can check if you got it). How do you think statististics and data mining can be useful in your career?

---

### Broadly speaking...


Almost everything that we will do can be categorized into one of two different pairs of things.

- **Supervised learning** vs. **unsupervised learning**
  - It's the learning algorithm that is being supervised, not you :-)
  - Unsupervised learning is associated with data mining and knowledge discovery. It is exploratory data analysis, learning qualitative features of structured/labeled data that were not previously known.  
  - Supervised learning is associated with machine learning and statistical inference. We might know the "truth" for (some) of the data that we are analyzing, which is "supervising" and guiding the learning process. We might have a physical model (or models) for the phenomenon against which we are fitting data, or using the data to selecting between physical models. 
  
  
![](https://assets.extrahop.com/images/blogart/supervised-vs-unsupervised-ml.png)
(Figure from Wu 2019)  

- **Classification** vs. **Regression**
  - Classification means that we are trying to put our data into different discrete categories, e.g. **model selection**.
  - Regression is the limit where the classification "bins" become continuous, e.g. **parameter estimation**.

![](https://www.researchgate.net/profile/Yves-Matanga/publication/326175998/figure/fig9/AS:644582983352328@1530691967314/Classification-vs-Regression.png)
(Figure from Matanga 2017)  
 

These can be combined together:
 * supervised classification
 * unsupervised classification (aka clustering)
 * supervised regression
 * unsupervised regression (aka dimensional reduction)

Graphically the course can be represented as a tour of the following [flowchart](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html):
![](http://scikit-learn.org/stable/_static/ml_map.png)

---

# Content

The course is divided into 9 sections:

1. **Probability**
2. **Frequentist inference**
3. **Bayesian inference**
4. **Density estimation and clustering**
5. **Dimensional reduction**
6. **Regression**
7. **Classification**
8. **Deep learning**
9. Time series (we won't have time for this but there's material on the website)

There's a lot to cover in data mining, so we'll necessaryly only provide a broad introduction on these topics. Not saying you'll be ready for a data science job after this class, but this will definitely put you ahead of many many astronomers. In any case, I hope that's enough to give you the key background to dig deep into whatever you'll need for your research, during your Master's thesis and beyond. 

## Website 

### Everything happens on github: [github.com/dgerosa/astrostatistics_bicocca_2024](https://github.com/dgerosa/astrostatistics_bicocca_2024)

There you'll find all the codes, example, resources, extra material, etc. I'm not going to use e-learning much, that is because I want the material of this class to be publicly available (and indeed, students from elsewhere are also studying this now!)


## Setup

Data mining and machine learning are computational subjects. One does not understand how to treat scientific data by reading equations on the blackboard: you will need to get your hands dirty (and this is the fun part!). Students are required to come to classes with a laptop or any device where you can code on (larger than a smartphone I would say...). Each class will pair theoretical explanations to hands-on exercises and demonstrations. These are the key content of the course, so please engage with them as much a possible.

The way the class is going to work is that I'll teach for I guess a bit more than half the time, and then we open the playground! At various points in the lectures you'll find **"Time to get your hands dirty!"**

![Untitled.jpg](attachment:Untitled.jpg)


These are practical lab sessions where you'll be asked to immediately apply the tecniques you have just seen. So, open your jupyter notebook and have fun! I will, of course, be around to answer questions, give hints, guide the class, etc.

### Which data?

The vast majority of the applications presented in this class make use of astrophysical data. Some are from my research (*), most of them are not. 

- Get a goat in a TV show riddle.
- Horse-kick deaths in the Prussian army in the 1850s.
- Cloning quasars from the Sloan Digital Sky Survey.
- Positions of quasars in the sky.
- Black-hole binaries can form in different ways. (*)
- Extracting energy from a black hole. (*)
- Planning a telescope observation.
- Charachertize astrophysical transients.
- Teach a computer how to read handwritten digits.
- The current catalog of gamma-ray bursts.
- Cleaning detector noise.
- Supernova distance and redshift data for cosmology.
- Is it a quasar or a galaxy? 
- Is this gravitational-wave detectable? (*)
- Light curves from RR Lyrae variable stars.
- The deep-learning playground.
- Is it a quasar or a galaxy? But with deep learning...
- Is this gravitational-wave detectable? (*) But with deep learning...

The amount of practical activities will increase during the class. I'm afraid at the beginning we will need to lay to groundwork with a bit more maths.

---

# Logistics


### Class times
The class covers 6 credits = 42 hours = 21 lectures of 2 hours each. Our weekly timeslots are **Monday 8.30am-10.30am** (sorry) and **Thursday 10.30am-12.30pm**. Note a few extra lectures on different days/times as well as a few weeks where we're going to skip classes). We're in room U2-05 on the main Bicocca campus.

 1. **04-03-24, 8.30am.**
 2. **07-03-24, 10.30am.**
 3. **11-03-24, 8.30am.**
 4. **14-03-24, 10.30am.**
 5. **18-03-24, 8.30am.** (graduation committee?)
 6. **21-03-24, 10.30am.**
 7. **25-03-24, 8.30am.**
 - 28-03-24 Holiday
 - 04-04-24, Davide is away for research
 8. **08-04-24, 8.30am.**
 9. **11-04-24, 10.30am.**
 10. **15-04-24, 8.30am.**
 - 18-04-24, Davide is away for research
 11. **22-04-24, 8.30am.**
 - 25-04-24 Holiday
 12. **29-04-24, 8.30am.**
 13. **30-04-24 10.30** (Note additional day!)  
 14. **02-05-24, 10.30am.**
 15. **06-05-24, 8.30am.**
 16. **09-05-24, 10.30am.**
 17. **13-05-24, 8.30am.**
 18. **16-05-24, 10.30am.**
 19. **20-05-24, 8.30am.**
 20. **23-05-24, 10.30am.**
 21. **27-05-24, 8.30am.**
 -  30-05-24, 10.30am. Could add a lecture here if we need it.


The calendar is available at [github.com/dgerosa/astrostatistics_bicocca_2024](https://github.com/dgerosa/astrostatistics_bicocca_2024).

### Exams

- During the lectures you'll have time to make a stab at all the **"Time to get your hands dirty!"** experiments. You will have the opportunity to start playing with those datasets, but they are meant to be examples to encourage you do to more (apply a different tecnique to the same data, etc). 
- At the exam, you will need to come with all your **"Time to get your hands dirty!"** notebooks (committed to your git fork, see below). At the exam I'll ask you to guide me through one or two of them.
- We  will then move on to other topics, including theory questions on all the material we covered. If you ever wonder whether a topic we covered is examinable, the answer is yes.
- Exams are by appointment only (the official exam dates are nominal). **While I'm happy to be flexible with the dates and value your time, that applies on your end as well.** If you're not prepared as you would have liked and decide to turn down the result, that's ok, but **I will ask you to wait about 2 months before coming back** (which is the rought equivalent of 6 exams attempts in 1 year).

--- 

# Textbook and resources


### Main textbook:

["Statistics, Data Mining, and Machine Learning in Astronomy"](https://press.princeton.edu/books/hardcover/9780691198309/statistics-data-mining-and-machine-learning-in-astronomy), Željko, Andrew, Jacob, and Gray. Princeton University Press, 2012.

It's a wonderful book that I keep on referring to in my research. The library has a few physical copies and you can download it for free from the library webiste with your Bicocca credentials. What I really like about that book is that they provide the code behind each single figure: [astroml.org/book\_figures](https://www.astroml.org/book_figures/). The best way to approach these topics is to study the introduction on the book, then grab the code and try to play with it.  Make sure you get the updated edition of the book (that's the one with a black cover, not orange) because all the examples have been updated to python 3.   

![](https://pup-assets.imgix.net/onix/images/9780691198309.jpg?fit=fill&fill=solid&fill-color=ffffff&w=1200&h=630)

### Other useful resources  

- ["Statistical Data Analysis"](https://global.oup.com/academic/product/statistical-data-analysis-9780198501558?cc=fr&lang=en&), Cowan. Oxford Science Publications, 1997. 
- ["Data Analysis: A Bayesian Tutorial"](https://global.oup.com/academic/product/data-analysis-9780198568322?cc=fr&lang=en&), Sivia and Skilling. Oxford Science Publications, 2006.
- ["Bayesian Data Analysis",](http://www.stat.columbia.edu/~gelman/book/) Gelman, Carlin, Stern, Dunson, Vehtari, and Rubin. Chapman & Hall, 2013. Free!
- ["Python Data Science Handbook",](https://jakevdp.github.io/PythonDataScienceHandbook/) VanderPlas. O'Reilly Media, 2016. Free!
- ["Practical Statistics for Astronomers"](https://www.cambridge.org/core/books/practical-statistics-for-astronomers/CEB9D5F985F062BAD67E7219B96A4CD6), Wall and Jenkins. Cambridge University Press, 2003.
- ["Bayesian Logical Data Analysis for the Physical Sciences",](https://www.cambridge.org/core/books/bayesian-logical-data-analysis-for-the-physical-sciences/09E9A95DAE275F5B005676C71B542598) Gregory. Cambridge University Press, 2005.
- ["Modern Statistical Methods For Astronomy" Feigelson and Babu.](https://www.cambridge.org/core/books/modern-statistical-methods-for-astronomy/941AE392A553D68DD7B02491BB66DDEC) Cambridge University Press, 2012.
- ["Information theory, inference, and learning algorithms"](https://www.inference.org.uk/mackay/itila/book.html) MacKay. Cambridge University Press, 2003. Free!  
- “Data analysis recipes". These free are chapters of books that is not yet finished by Hogg et al.
    - ["Choosing the binning for a histogram"](https://arxiv.org/abs/0807.4820) [arXiv:0807.4820]
    - ["Fitting a model to data](https://arxiv.org/abs/1008.4686) [arXiv:1008.4686]
    - ["Probability calculus for inference"](https://arxiv.org/abs/1205.4446) [arXiv:1205.4446]
    - ["Using Markov Chain Monte Carlo"](https://arxiv.org/abs/1710.06068) [arXiv:1710.06068]
    - ["Products of multivariate Gaussians in Bayesian inferences"](https://arxiv.org/abs/2005.14199) [arXiv:2005.14199]
- ["Practical Guidance for Bayesian Inference in Astronomy"](https://arxiv.org/abs/2302.04703), Eadie et al., 2023.
- ["Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow"](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/), Geron, O'Reilly Media, 2019.
- ["Machine Learning for Physics and Astronomy"](https://press.princeton.edu/books/paperback/9780691206417/machine-learning-for-physics-and-astronomy), Acquaviva, Princeton University Press, 2023.


### Still need to embrace the python world?

We will make heavy usage of the python programming language. If you need to refresh your **python skills**, here are some catch-up resources and online tutorials. A strong python programming background is essential in modern astrophysics.

- ["Scientific Computing with Pythion"](https://github.com/dgerosa/scientificcomputing_bicocca_2023), D. Gerosa. This is a class I teach for the PhD School here at Milano-Bicocca.
- ["Lectures on scientific computing with Python"](https://github.com/jrjohansson/scientific-python-lectures), R. Johansson et al.  
- [Python Programming for Scientists"](https://astrofrog.github.io/py4sci/), T. Robitaille et al.
- ["Learning Scientific Programming with Python"](https://www.cambridge.org/core/books/learning-scientific-programming-with-python/3D264483BC7B380A3059B3861C661237), Hill, Cambridge University Press, 2020. Supporting code: [scipython.com](https://scipython.com/).

---

# Credits and feedback

### I need your help!

This is the third year that Bicocca offers an Astrostatistics class. I'm not entirely sure but I think this is also the only class of this kind in the country. 

**This also means I haven't taught this that many times.** So please be kind, things are not going to be perfect... But I want to mke it better!

Very important: **please do give me feedback** (what works, what doesn't work, what topics have been covered in other classes already, if I assume too much from your computing skills, or if you want me to go faster, if the excercises are too hard or too easy, etc).

I plan to circulate a form for anonymous feedback at the end of the class, but please feel free to come and give me your opinions at any time! This is particularly useful so I can adjust the class as we proceed.

I'm vey much aware I'm setting a high bar, this class will be demanding. I think we always need a challenge in front of us to be excited by something new.


### Get in touch!

I'm very happy to chat about the class (and more: gravity, science, career prospects, etc). My office is number 2007 at the second floor of the U2 building. Feel free to stop by and knock at my door. Or send me an email for an appointment: [davide.gerosa@unimib.it](mailto:davide.gerosa@unimib.it).

### Recordings

The classes will be recorded but not live streamed. Recordings will be available on our [e-learning page](https://elearning.unimib.it/course/view.php?id=35298), not on github for privacy reasons.

**I think that attending lectures in person is crucial.** As you will see very soon, you will be asked to immediately apply what you have learned, while you learn it. If you're not here, you'll miss out on the vast majority of the learning experience. Binge-watching the recorded lectures before the exam is not the same thing as attending a class (this is always true in my opinion, but especially in this case). 


### A huge thanks to...

This class draws heavily from many others that came before me. Credit goes to:

- Stephen Taylor (Vanderbilt University): [github.com/VanderbiltAstronomy/astr_8070_s21](github.com/VanderbiltAstronomy/astr_8070_s21).
- Gordon Richards (Drexel University): [github.com/gtrichards/PHYS_440_540](https://github.com/gtrichards/PHYS_440_540).
- Jake Vanderplas (University of Washington): [github.com/jakevdp/ESAC-stats-2014](https://github.com/jakevdp/ESAC-stats-2014).
- Zeljko Ivezic (University of Washington): [github.com/uw-astr-302-w18/astr-302-w18](https://github.com/uw-astr-302-w18/astr-302-w18).
- Andy Connolly (University of Washington): [cadence.lsst.org/introAstroML/](http://cadence.lsst.org/introAstroML).
- Karen Leighly (University of Oklahoma): [seminar.ouml.org/](http://seminar.ouml.org).
- Adam Miller (Northwestern University): [github.com/LSSTC-DSFP/LSSTC-DSFP-Sessions/](https://github.com/LSSTC-DSFP/LSSTC-DSFP-Sessions).
- Jo Bovy (University of Toronto): [astro.utoronto.ca/~bovy/teaching.html](http://astro.utoronto.ca/~bovy/teaching.html).
- Thomas Wiecki (PyMC Labs): [twiecki.github.io/blog/2015/11/10/mcmc-sampling](http://twiecki.github.io/blog/2015/11/10/mcmc-sampling).
- Aurelienne Geron (freelancer): [github.com/ageron/handson-ml2](https://github.com/ageron/handson-ml2).



---

# IT setup

Again, everything happens on github: [github.com/dgerosa/astrostatistics_bicocca_2024](https://github.com/dgerosa/astrostatistics_bicocca_2024)


# Run python

All this class is developed in python (ie: code and lecture notes coincide). There are at least three ways to run the code I prepared.

## 1. Binder environment

Right at the top of the page you see this button: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/dgerosa/astrostatistics_bicocca_2024/HEAD)

That will open a interactive  environment hosted on a public cloud service where all the software we are going to need is already installed (loading might take a while). Those are jupyter notebooks like those I will use during the lectures. 

This is a good option if you just want to check something quickly (perhaps while following along, or revising before the exams etc). 

No worries if Binder takes a few minutes to load, that's normal (if you want to know more about the magic behind this, check out [mybinder](https://mybinder.org/) and [docker](https://www.docker.com/)).

#### Very important

Binder session are temporary. Everytime you close and re-open you session, the environment starts back from what's available on Github. So if you make changes there while following along and solving the in-class problems, **they will be lost forever**.

I say it again, when using Binder **it is imperative that you download the document on your machine (there's a button in the toolbar) before closing  the browser window.** You can also save and load in the browser's cache, but it's less reliable in my opinion.


##  2. Unimib virtual machine

Binder is the easiest way to just follow along. When more coding is involved, you can use the Bicocca virtual machines. (I assume you've used the system before, right?).

First, sign up using this link: https://libaas.unimib.it/PubLab/register/34ddf7f4010db47dfa35

Once you signed up, go to https://libaas-lessons.si.unimib.it to access your virtual machine. Or click this button on the github repository homepage:

[![](https://custom-icon-badges.herokuapp.com/badge/launch-unimib%20virtual%20machine-orange.svg?logo=container&logoColor=white)](https://libaas-lessons.si.unimib.it)

## 3. Your own python distribution

At some point in your research you'll need to run python code on your laptopt. I guess most of you have done it already for earlier classes (if so, how many?). It might take a bit of effort to set it up, but sooner is better than later. 

If you have trouble installing python on your laptopt, I'm happy to help as I can. I have done it before on both Mac and Linux and can debug your errors. I don't have personal experience with Windows but I was told that getting the [Anaconda installer](https://www.anaconda.com/products/individual#windows) is now the easiest way. 

## Installing python packages

You probably know this already, but installing things in python is as easy as typing `pip install something` (if pip install doesn't work, something is wrong with your python installation).

All the packages you need for this class are listed at [requirements.txt](https://github.com/dgerosa/astrostatistics_bicocca_2024/blob/main/requirements.txt) on Github. If using Binder, they are all installed by default. 


---

# Version control with git

- How many of you have heard about `git` before?
- How many of you have some experience with it?

**Disclaimer**. You can probably get through this class without too much `git`, but I very highly reccomend learning it. It's  something that will make your life so much easier in research! You won't regret it.

In brief, git is a strategy to handle your files, as simple as that. Most crucially, it scales extremely well with number of people and complexity of the workflow. 

- Imagine having hundreds or maybe thousands of developers working on the same piece of code or on the same paper. Good luck sharing files with things like Dropbox, Google Drive, or Overleaf! (Full disclaimer: using Overleaf, at least the free version, is a terrible idea, don't use it). 
- But even if it's just for  yourself... Ever happened that you code **used to** work a while ago, then you changed something, and now desperately want to go back? 

![](https://www.atlassian.com/dam/jcr:9f149cef-f784-43de-8207-3e7968789a1f/03.svg)

### The basics

- `git` is a transfer protocol (kind of like `http` or `ssh`) that is designed explicitely for code development.
- On top of `git`, people have built web frontends to make our life easier (much like a browser for `http`). The most popular of these frontends is [github.com](https://github.com/) (which is owned by Microsoft now).

The core element is a "repository", which is basically just a directory --or better: a directory git can talk to. A single repository can have several instances:
- A remote server hosts a copy. In this case, we'll put it on github.com (it's free!).
- Developer 1 (say me) has a copy.
- Develper 2 (say you) has another copy.
- etc.

The repository of each developer talks only to the remote server:

![](https://www.cs.swarthmore.edu/git/git-repos.svg)

The process starts by creating a remote repository and **cloning** it locally (that's something you do only once). Someone changes the code locally and **pushes** it to the remote server. Someone else **pulls** the modification and goes on from there. The system is very smart, and has very precise rules on how to handle cases where both people edit the same file. Making a modification implies **adding** and **committing** a file.

These five commands (`clone`, `pull`, `push`, `add`, `commit`) already let you to a ton of powerful stuff.  



## Time to get your hands dirty!

The first episode of "time to get your hands dirty" is about `git` and version control.


### Part 1. Make sure you have it.

1. 

To get `git` for any platform see: [https://git-scm.com/download/](https://git-scm.com/download/). If you don't have git installed right now on your machine, I encourage you to sort it out later in your own time.  For now you can use a Bicocca virtual machine, where `git` is already installed and functioning. Open a terminal window and run

```bash
which git
```

You should get a path like `/usr/bin/git` that indicates git is indeed present on your machine. 

2. 
Now create an account on [github.com](https://github.com/). Do what they say, and obviously select the free version. 

**Very important**. Pick a professional username. Treat your github profile as an extension of your CV. People will look at ti!

3. 
We now need to setup cryptographyic keys (we'll use [RSA keys](https://en.wikipedia.org/wiki/RSA_(cryptosystem)), a very fascinating concept of safe encryption which has to do with prime number theory).

On your terminal type
```bash|
ssh-keygen
```
and hit return three times. You should see the paths of the keys. Now copy the content of your public key (not the private one!)

```bash
cat [path]/id_rsa.pub
```

Go to github, top-right corner, Settings, SSH and GPG keys, New SSH keys. Paste the content of `id_ras.pub` into that box. Careful about adding unwanted new-line characters when copying and pasting.


### Part 2. Let's go!

Let's practice some git now!

1. On [github.com](https://github.com/), create a repository called `ilovegit`.

2. clone your repo using:
```bash
cd ~/reps # Or wherever you want it to be
git clone git@github.com:YOUR_GITHUB_USERNAME/ilovegit.git
```

3. start Jupyter in the cloned directory
```bash
cd ilovegit
jupyter notebook &
```
4. create a new notebook. Name it `hello.ipynb`. Add a cell with the following piece of code:
```python
print("Hello World!")
```
5. see what happened:
```bash
git status
```
6. add the notebook to your git repository and commit by running (in the terminal window) the following:
```bash
git add hello.ipynb
git commit -m "Added hello.ipynb to repository."
```
7. see what happened:
```bash
git status
```
8. make another change in the Jupyter notebook. For example, add another cell ("+" icon on the toolbar) with the following:
```python
x = 2+2
print(x)
```
9. see what happened
```bash
git status
```
10. commit changed files (the `a` options is equivalent to do `add` and then `commit`)
```bash
git commit -am "Updated hello.ipynb with complex mathematics."
```
11. "push" the changes to github
```bash
git push
```
12. go browse the result on github

13. edit the readme from the broswer on github (this is to mimick what happens when someone touches the code)

14. "pull" the changes from github
```bash
git pull
```
15. have a look at your local copy of README.md


### Part 3. Interact with the class material

I developed this class using `git`. Go to the class git repository at https://github.com/dgerosa/astrostatistics_bicocca_2024. **Don't clone this!** Instead, look to the rop right of the page for an option to fork the repository. This will make a copy of the class repository for your own personal use.

Now that you have a fork of the repository, clone it to your machine.

```bash
git clone git@github.com:YOUR_GITHUB_USERNAME/astrostatistics_bicocca_2024.git
```


Before proceeding further, we're now going to add the `dgerosa` repository as an [`upstream` repository to your fork](https://docs.github.com/en/free-pro-team@latest/github/collaborating-with-issues-and-pull-requests/configuring-a-remote-for-a-fork). First, list the current configured remote repository for your fork with:

```bash
git remote -v
```

Now, add the `dgerosa` repo as an `upstream`:

```bash
git remote add upstream https://github.com/dgerosa/astrostatistics_bicocca_2024
```

Verify that the new repository shows as an upstream by running `git remote -v` again.


You now have the ability to work with your own fork, sync upstream changes to this fork, and commit changes to your fork. (we won't do it, but git allows you to ask for permission to incorporate changes upstream, this feature is called `pull request`).

In order to [sync new lectures from upstream to your fork](https://docs.github.com/en/free-pro-team@latest/github/collaborating-with-issues-and-pull-requests/syncing-a-fork), run the following in the local directory of your cloned fork:

```bash
git fetch upstream
git checkout main
git merge upstream/main
```

You should do this often in order to see new materials that I add. 

I would like you to come to the exam with all your "Get you hands dirty" exercises cleanly committed to your repository fork. 

- `lectures` contains the material shown during classes.
- `solutions` contains my exploration with the proposed datasets. Better if you don't write anything in any of these two directories. 
- `working` is an empty directory for you. Put your solutions there.

----

These commands are going to be enough to get you through this class. If you want to dig deeper into git, this an excellent beginning-to-end crash course of about 1 hour. I found it very clear.

In [1]:
from IPython.display import YouTubeVideo
YouTubeVideo('RGOj5yH7evk')

### If you really hate all of this...

Learning `git` is a core content of this class (and, well, needless to say but this will be taken into account at the exam). This is because I believe `git` is a cornerstone of modern software development and I really think learning it would boost your science careers.

The shortcut is to go to  https://github.com/dgerosa/astrostatistics_bicocca_2024 then "Code" and "Download ZIP". This will download a copy of the class material witouth any git interaction. However, you'll need to do it manually every time I update the material (and sort out differences between your changes and my changes... good luck!).