# Assignment 2

## Machine Learning Techniques

This assignment is split into 3 sections, roughly corresponding to the contents of each of the 3 weeks in the Machine Learning module. 

All assignments are presented as Jupyter notebooks. You will fork the repository to have your own access to all files. You can edit this notebook directly with your answers and push your changes to GitHub. 

The goal of this assignment is to use different ML techniques to explore your data, find patterns in it, and eventually build a model that will allow us to predict stellar mass & redshift of galaxies *without doing SED fitting*.

# Section 1: Data Compression

#### Question 1

What is a dimensionality reduction technique? Why would you use one?

#### Question 2

There are many different data compression techniques: PCA, UMAP, tSNE, VAE... Pick two of these methods, and explain briefly: 
* How do each one of them work?
* What are advantages or disadvantages of each method?
* When would you use one over the other?

#### Question 3

The code below loads in the input data catalog.


In [2]:
from astropy.table import Table
from astropy.io import fits
with fits.open('../data/sw_input.fits') as f:
    df = Table(f[1].data).to_pandas()
    f.close()
    
# Display the top 3 rows of the dataframe
df.head(3)

Unnamed: 0,id,ra,dec,redshift,PLATE,MJD,FIBERID,designation,flux0_u,flux0_u_e,...,flux_w2_e,flux_w3,flux_w3_e,flux_w4,flux_w4_e,extin_u,extin_g,extin_r,extin_i,extin_z
0,3,337.45031,1.266134,0.088372,376,52143,404,J222948.07+011558.1,3.1e-05,3e-06,...,4.9e-05,4.172e-07,0.000209,2e-06,0.001187,0.341327,0.26596,0.18399,0.136724,0.101698
1,5,338.115522,1.270146,0.1638,376,52143,567,J223227.69+011612.6,1.1e-05,4e-06,...,0.000111,9.851e-07,0.000493,4e-06,0.001883,0.368063,0.286793,0.198402,0.147434,0.109664
2,8,341.101481,1.266255,0.143369,378,52146,404,J224424.38+011558.3,1.7e-05,3e-06,...,3.9e-05,1.0137e-06,0.000507,8e-06,0.003856,0.33763,0.263079,0.181997,0.135243,0.100596


Then do the following:

1. Select the *meaningful* columns from the dataframe (i.e., those you think have predictive value).
2. Choose a reasonably-sized subset of your data ($10^3 \sim 10^4$ so galaxies)
    > Make sure to save your subset, or at least the IDs you chose, for later - you will need them!
4. Choose a dimensionality reduction technique
    > Most already have easy-to-use implementations so you don't have to code them from scratch: [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), [UMAP](https://umap-learn.readthedocs.io/en/latest/index.html), [tSNR](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). Training something like a Variational Autoencoder is a more involved task and requires access to a GPU.
5. Reduce your data to 2 dimensions using your chosen algorithm
6. Save your output
    > Remember to keep the IDs with the principal components, so that you can easily see which galaxy those values are for later
7. Plot the two principal variables against each other and describe what you see
    * Are there any obvious patterns in your data?
    * Are there any clusters?
  
If you are feeling brave, you can try several encoding tools - how do your results change? What if you keep more than 2 principal variables? Does changing *hyperparameters* of your algorithm change your results quantitatively / qualitatively?


#### Question 4

Now, load in the `sw_output.fits` table and cross-match the two tables to get stellar masses, redshifts, dust opacities, etc.

7. Color the points on your plot above by a physical property and discuss if you see any patterns.

# Section 2: Unsupervised ML

In section, you will implement a clustering algorithm to see if there are any *natural* clusters in your data. You can choose any algorithm from the ones shown [on the Scikit-Learn website](https://scikit-learn.org/stable/modules/clustering.html). The best algorithm depends on your data: so refer back to the plots you made in Section 1 to see which algorithm you think will work best. 

Load in the subset you chose in the previous section. 

In [3]:
# Space for code

#### Question 1

Choose a clustering algorithm. Why did you go for this particular one?

#### Question 2

Run clustering on your subsample (again, only using the columns you think are relevant / have important information). Think of these questions, if they are relevant to your algorithm - often the *hyperparameters* of your algorithm will need you to answer these.

* How many clusters should you fit to your data?
* Where should the initial guesses for the cluster centers be?
* What should be the typical size for each cluster?

#### Question 3

Look at the average *physical properties* (mass, redshift, dust...) from the output catalog for each one of your clusters. Are the clusters statistically significant in any of these properties?

#### Question 4

* Repeat **steps 1-3** but using your compressed (2-dimensional) dataset instead of the full one. Do you see any differences in your results?
* Plot your principal components against one another, this time coloring points by the cluster they belong to. Visually, do you think your clustering worked well?


<font color='red'>Optional? Maybe? Undecided:</font> plot the decision boundary plot (see [examples here](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.DecisionBoundaryDisplay.html)) for your clustering algorithm... this is easier after dimensionality reduction

Also aside note to self: remind them to normalize data

# STEP 3 - Supervised learning

Use a supervised learning tool (decision trees... regression... whatever they want) to predict mass/redshift from photometry for their sample

* Evaluate accuracy 
* Compare to a different sample in the data (test set) - what is the accuracy
* Compare to my cherry picked sample (to do) - what is the accuracy

In [None]:
#### Supervised M##