<table width="100%">
    <td align="left">
        <a target="_blank", href="https://www.up.pt/fcup/en/">
            <img src="https://divulgacao.iastro.pt/wp-content/uploads/2023/03/FCUP_logo-print_blcktransp_600ppi.png" width="90px" height="90px" style="padding-bottom:5px;"/>
        </a>
    </td>
    <td>
        <a target="_blank", href="https://www.iastro.pt/">
            <img src="https://divulgacao.iastro.pt/wp-content/uploads/2018/03/IA_logo_bitmap-rgbblack-1200px-388x259.png" width="90px" height="90px" style="padding-bottom:5px;"/>
        </a>
    </td>
    <td align="center">
        <a target="_blank" href="https://colab.research.google.com/github/jbrinchmann/MLD2025/blob/main/Notebooks/MLD2025-09-PCA%20of%20Pickles.ipynb">
           <img src="https://tinyurl.com/3mm2cyk6"  width="90px" height="90px" style="padding-bottom:5px;"/>Run in Google Colab
        </a>
    </td>
<td align="center"><a target="_blank" href="https://github.com/jbrinchmann/MLD2025/blob/main/Notebooks/MLD2025-09-PCA%20of%20Pickles.ipynb">
<img src="https://tinyurl.com/25h5fw53"  width="90px" height="60px" style="padding-bottom:0px;"  />View Source on GitHub</a></td>
</table>

# Do a PCA decomposition of the Pickles library

This notebook shows how to PCA decomposition of a set of spectra and how to reconstruct spectra from a smaller set of PCA components.

This can be done in a variety of ways - depending on what data we focus on.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from astropy.io import fits
from astropy.table import Table
import pandas as pd
import seaborn as sns

%matplotlib inline

## Loading of the library of spectra

The spectra are stored in a FITS file in the Pickles subdirectory called `pickles-spectra.fits`. This contains the wavelength axis in the first HDU, the flux in the next and the flux uncertainty in the last. But we also need to get the overview table which has the classification of the spectra.

In [None]:
# For Colab:
#!wget --quiet -O pickles-spectra.fits https://github.com/jbrinchmann/MLD2025/raw/refs/heads/main/Datafiles/pickles-spectra.fits
#!wget --quiet -O overview-of-spectra.vot https://github.com/jbrinchmann/MLD2025/raw/refs/heads/main/Datafiles/overview-of-spectra.vot


In [None]:
def load_pickles_library():
    hdul = fits.open('pickles-spectra.fits')
    wave = hdul[0].data
    flux = hdul[1].data
    dflux = hdul[2].data
    
    return wave, flux, dflux

def load_overview_table():
    return Table().read('overview-of-spectra.vot')

In [None]:
wave, flux, dflux = load_pickles_library()
t_overview = load_overview_table()

In [None]:
flux.shape

## Examining a few spectra

Let us here plot the first spectrum in each class.

In [None]:
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(12,7))

MKclasses = ['o', 'b', 'a', 'f', 'g', 'k', 'm']
for MK in MKclasses:
    ii, = np.where(t_overview['SPType'] == MK)
    
    ax.plot(np.log10(wave), flux[:, ii[0]], label='Class={0} [{1} spectra]'.format(MK, len(ii)))
    
ax.set_xlabel('Log wavelength')
ax.set_ylabel('Flux')
ax.legend()

But of course that is not a very nice illustration. 

## Task: do a better plot for the optical region

Create a nice illustration focusing on the 3000Å to 10000Å region - the code below creates some subset of the data which will be useful further on.

In [None]:
i_optical, = np.where((wave > 3000) & (wave < 10000))
flux_opt = flux[i_optical, :]
dflux_opt = dflux[i_optical, :]
wave_opt = wave[i_optical]

## Setting up for PCA

We now need to import the appropriate libraries for PCA.

In [None]:
from sklearn.decomposition import NMF
from sklearn.decomposition import PCA

## Task: carry out a PCA analysis of these spectra

What you should do is:

- Setup the PCA model
- Subtract off the mean spectrum
- How many significant PCA components do you find? Can you come up with some physical argument for this number?
- Compare the PCA components to the MK class (`t_overview['numtype']`)
- If you feel adventurous try to reconstruct the spectrum using a few PC components.

The solution shows a bit more so is worth consulting.


In [None]:
# This does automatic whitening
pca = PCA(n_components=10, whiten=False)

In [None]:
# This is the design matrix (note the transpose to adhere to the sklearn convention)
X = flux_opt.T.copy()