# 1. Classification as a prototype of machine learning technique

One typical example of machine learning is the classification problem. This is the process of attributing a "**class**" (or equivalently, a "**label**") to an object of an arbitrary type (e.g. a string, an image, or numbers) in order to catalogue it based on its **properties** (the actual information that we pass to the machine, e.g. the pixel intensities in the case of an image).

<center><img src="images/classification.png" width=500> 
Figure 1.1. Schematic classification in a 2D plot.<br>
(Credit: <a href="https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d"  target="_blank" rel="noopener noreferrer">Supervised vs. Unsupervised Learning, by Devin Soni</a>)</center>

## Types of Classifications

Classification can come in two major flavors, based on the type of intervention by the user:

- **Unsupervised**: The classification is defined "unsupervised" when the user does not provide labels during the training process and the machine learns the definition of each class from the data. This is exactly the case of "Clustering" that we saw in the previous session.  

>    _In practice, the machine learns to cluster objects with similar properties._


- **Supervised**: The classification is defined "supervised" when the user provides a label for each object in the training set. In this case, the idea is that we can train the model to associate the label with some given characteristics of the training data.

>    _In practice, the machine learns to find similarities between objects with the same label._

## Supervised classification - generative / discriminative classification

One approach to the classification problem involves the recognition of the function describing the density distribution of each class in the parameter space. This type of problem is called **generative classification** because it implies that we can find the distribution from which the data are generated (or better said, sampled).

**Discriminative classification** instead, includes any method which separates classes by "drawing" a boundary in the parameter space.

In the reminder, we will only focus on the latter type.

## 2. Classification algorithms overview

We start by presenting a set of [**sklearn**](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) algorithms with toy datasets, and then describe some of them. 

This serves as a showcase of the available methods and how they compare. You can easily adapt any of these methods to the following examples or your own problems. 

In [None]:
# # We provide the code fully here for completeness although it does not run
# # due to the fact that DecisionBoundaryDisplay function is in the experimental version

# ######################################################################################

# # Code source: Gaël Varoquaux
# #              Andreas Müller
# # Modified for documentation by Jaques Grobler
# # License: BSD 3 clause

# import numpy as np
# import matplotlib.pyplot as plt
# from matplotlib.colors import ListedColormap
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import StandardScaler
# from sklearn.datasets import make_moons, make_circles, make_classification
# from sklearn.neural_network import MLPClassifier
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.svm import SVC
# from sklearn.gaussian_process import GaussianProcessClassifier
# from sklearn.gaussian_process.kernels import RBF
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
# from sklearn.naive_bayes import GaussianNB
# from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
# from sklearn.inspection import DecisionBoundaryDisplay

# # Generate toy dataset

# X, y = make_classification(
#     n_features=2, n_redundant=0, n_informative=2, random_state=1, n_clusters_per_class=1
# )
# rng = np.random.RandomState(2)
# X += 2 * rng.uniform(size=X.shape)
# linearly_separable = (X, y)

# datasets = [
#     make_moons(noise=0.3, random_state=0),
#     make_circles(noise=0.2, factor=0.5, random_state=1),
#     linearly_separable,
# ]

# # Setting up the classification algorithms (parameters)

# names = [
#     "Nearest Neighbors",
#     "Linear SVM",
#     "RBF SVM",
#     "Gaussian Process",
#     "Decision Tree",
#     "Random Forest",
#     "Neural Net",
#     "AdaBoost",
#     "Naive Bayes",
#     "QDA",
# ]

# classifiers = [
#     KNeighborsClassifier(3),
#     SVC(kernel="linear", C=0.025),
#     SVC(gamma=2, C=1),
#     GaussianProcessClassifier(1.0 * RBF(1.0)),
#     DecisionTreeClassifier(max_depth=5),
#     RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
#     MLPClassifier(alpha=1, max_iter=1000),
#     AdaBoostClassifier(),
#     GaussianNB(),
#     QuadraticDiscriminantAnalysis(),
# ]

# # Running and plotting

# figure = plt.figure(figsize=(27, 9))
# i = 1
# # iterate over datasets
# for ds_cnt, ds in enumerate(datasets):
#     # preprocess dataset, split into train and test part
#     X, y = ds
#     X = StandardScaler().fit_transform(X)
#     X_train, X_test, y_train, y_test = train_test_split(
#         X, y, test_size=0.4, random_state=42
#     )

#     x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
#     y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5

#     # just plot the dataset first
#     cm = plt.cm.RdBu
#     cm_bright = ListedColormap(["#FF0000", "#0000FF"])
#     ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
#     if ds_cnt == 0:
#         ax.set_title("Input data")
#     # Plot the training points
#     ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright, edgecolors="k")
#     # Plot the testing points
#     ax.scatter(
#         X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6, edgecolors="k"
#     )
#     ax.set_xlim(x_min, x_max)
#     ax.set_ylim(y_min, y_max)
#     ax.set_xticks(())
#     ax.set_yticks(())
#     i += 1

#     # iterate over classifiers
#     for name, clf in zip(names, classifiers):
#         ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
#         clf.fit(X_train, y_train)
#         score = clf.score(X_test, y_test)
#         DecisionBoundaryDisplay.from_estimator(
#             clf, X, cmap=cm, alpha=0.8, ax=ax, eps=0.5
#         )

#         # Plot the training points
#         ax.scatter(
#             X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright, edgecolors="k"
#         )
#         # Plot the testing points
#         ax.scatter(
#             X_test[:, 0],
#             X_test[:, 1],
#             c=y_test,
#             cmap=cm_bright,
#             edgecolors="k",
#             alpha=0.6,
#         )

#         ax.set_xlim(x_min, x_max)
#         ax.set_ylim(y_min, y_max)
#         ax.set_xticks(())
#         ax.set_yticks(())
#         if ds_cnt == 0:
#             ax.set_title(name)
#         ax.text(
#             x_max - 0.3,
#             y_min + 0.3,
#             ("%.2f" % score).lstrip("0"),
#             size=15,
#             horizontalalignment="right",
#         )
#         i += 1

# plt.tight_layout()
# plt.show()

<div style="text-align: center;">
<img src="images/sklearn_classification.png" width=1200> 
Figure 2.1. Overview of the clustering algorithms in sklearn
</img>
    </div>

## Quizz time: take a few moments and explore the results - what do you notice? 

Write some points here:

- point 1
- point 2




# 2. The sample for our classification example

To properly classify a star we need a spectrum. However, this is time consuming and limited to one (single slit spectroscopy) or a few tens of sources (multi-object spectroscopy). On the contrary, imaging in various different filters can be done easily and for thousands of sources per image. Photometry at different wavelengths result in a very low-resolution "spectrum".</br> 

<center><img src="images/Girardi2002-photometric systems.gif"> 
Figure 2.1. The filter+detector transmission curves for a number of different systems, along with indicative spectra of Vega, the Sun, and a M5 giant. <br>
(Fig 3. from <a href="https://ui.adsabs.harvard.edu/abs/2002A%26A...391..195G/abstract" target="_blank" rel="noopener noreferrer"> Girardi et al. (2002)]</a>)</center>

In this example we are using a set of photometric measurements (from optical to mid-IR bands) for a sample of massive evolved stars in the Large Magellanic Cloud (based on these works: [Bonanos et al. (2009) AJ, 138, 1003](https://ui.adsabs.harvard.edu/abs/2009AJ....138.1003B/abstract), [Neugent et al. (2012), ApJ, 749, 177](https://ui.adsabs.harvard.edu/abs/2012ApJ...749..177N/abstract), and [Davies, Crowther & Beasor (2018), MNRAS, 478, 313](https://ui.adsabs.harvard.edu/abs/2018MNRAS.478.3138D/abstract)). 

<center><img src="images/Massey2013-HRD.png" width=600> 
Figure 2.1. A slightly modified version of Fig. 1 from <a href="https://ui.adsabs.harvard.edu/abs/2013NewAR..57...14M/abstract" target="_blank" rel="noopener noreferrer"> Massey et al. (2013)]</a>)</center>

Our aim is to use a method that will help us **distinguish different classes** of objects. For our purposes we will use OBA stars (main-sequense objects), OBAe (a subcategory of OBA stars with circumstellar disks and emission lines), Wolf-Rayet stars (hot evolved stars with strong stellar winds that actually strip their envelopes), Yellow and Red supergiants (evolved massive stars). For convenience we will use OBA, OBAe, WR, YSG, and RSG, respectively, as labels.

&#9733; A similar, but more elaborated, implementation is performed in [Maravelias et al. (2022), arXiv: 2203.08125](https://arxiv.org/abs/2203.08125).



## Load and examine data


In [None]:
import numpy as np
import matplotlib.pyplot as plt

&#9755; You do not need ```pandas``` to do everything with files! ```genfromtxt``` is very powerful but has its tweaks! 

In [None]:
dfile = "data/LMC_phot_data.csv"
miss_value = -999.0  # when entries are missing

data = np.genfromtxt(dfile, dtype=None, 
                     comments='#', delimiter=',', 
                     filling_values = miss_value, 
                     names=True, autostrip='Yes')

#examine data
print("Let us see what we have:\n")
print("The column names:")
print(data.dtype.names)
print("-"*25)
print("Let's print the spectral types only:")
print(data['SpT'])
print("-"*25)

HINT: We want to group the different spectral classes found, and for this we are using the indeces and not the data values directly. in that way we can select all data for the same objects.

In [None]:
from collections import defaultdict

classes = defaultdict(list)

for i in range(0,len(data['SpT'])): 
#    print(i, data['SpT'][i].decode('utf-8'))
    classes[data['SpT'][i].decode('utf-8')].append(i)

#print(classes)
unique_cls = sorted(set(classes.keys()))
print(unique_cls)
print("> SUMMARY of loaded data:")
print("=========================")
for sptype in unique_cls:
    number = len(classes[sptype])
    print(f"{sptype:-<6s}--> {number:>3} stars")

In [None]:
bands = [b for b in data.dtype.names[3:-1] if 'e_' not in b]

def reminder():
    """ 
    A simple function to print all bands
    and classes available.
    """
    print('Available bands to use: ')
    print(','.join(bands))
    print('-'*25)
    print('Available classes to use:')
    print(','.join(unique_cls))

In [None]:
from astropy.table import Table, Column

print(f'Available photometry for: {", ".join(bands)}')     

# Constructing the table for the statistics 
phot_data_col_names = ['Class', 'All'] + [bb for bb in bands]
phot_data_per = Table( names = phot_data_col_names, dtype = ['S3']+['i4']+['f2']*(len(bands)))

for spt in unique_cls:
    indcs = classes[spt]
    starsWbands = defaultdict(list) # keep those with measurements across all
    for star in indcs:
#        print(spt, star)
        for bnd in bands:
            mag = data[star][bnd]
#            print(mag)
            if mag!=miss_value:
                starsWbands[bnd].append(star)    
    row_data_per = [spt, len(indcs)] + [(len(starsWbands[bb])/len(indcs))*100 for bb in bands]
    phot_data_per.add_row ( row_data_per )
    
print("\nNumber of stars per band (in %)\n")
phot_data_per    

&#9755; 36, 45, 58, 89, 24 corresponds to the 3.6um, 4.5um, 5.8um, 8.9um, 24um bands of *Spitzer*. One common representation of these bands is [3.6], [4.5], [5.8], [8.0], [24] but the use of "[]" and "." is inconvenient, so we drop them. 

## Question: What do you notice here ?


## Visualize data - select features

In [None]:
reminder()

In [None]:
def selmags( band1, band2, cls):
    """
    Function to select sources of a specific
    spectral class (cls) and return the magnitudes
    that correspond to bands 1 and 2.
        """
    # all indeces of the particular class
    cls_indcs = np.asarray( classes[cls] ) 
    # selecting those indeces of the class
    # that do not contain missing values, ie -999
    sel_cls_indcs = np.where( (data[band1][cls_indcs]!=miss_value)
                        & (data[band2][cls_indcs]!=miss_value) )[0]

    sel_indcs = cls_indcs[sel_cls_indcs]
    rem_indcs = len(cls_indcs)-len(sel_indcs)
    print(f'-- {cls}: excluding {rem_indcs} out of {len(cls_indcs)} sources ({rem_indcs/len(cls_indcs)*100:.1f}%)')
    mag1, mag2 = data[band1][sel_indcs], data[band2][sel_indcs]    
    
    return mag1, mag2

In [None]:
fig, ax = plt.subplots(2,2, figsize=(12, 12))

selected_spt = unique_cls              # if you want to print all
#selected_spt = ['RSG', 'OBA', 'WR']     # put your selection here

# plot 1
band1_1 = ''
band1_2 = ''

# plot 2
band2_1 = ''
band2_2 = ''

print('- plot1:')
for s in selected_spt: 
    plt1 = selmags( band1_1, band1_2, s)

    ax[0,0].plot(plt1[0], plt1[1], 'o', label=f'{s}: {len(plt1[0])}')
    ax[0,0].set_xlabel(f'{band1_1}') #'-{band1_2}')
    ax[0,0].set_ylabel(band1_2)
    ax[0,0].invert_yaxis()
    ax[0,0].invert_xaxis()
    ax[0,0].legend()

    ax[0,1].plot(plt1[0]-plt1[1], plt1[1], 'o', label=f'{s}: {len(plt1[0])}')
    ax[0,1].set_xlabel(f'{band1_1}-{band1_2}')
    ax[0,1].set_ylabel(band1_2)
    ax[0,1].invert_yaxis()
    ax[0,1].legend()
    
    
print()
print('- plot2:')        
for s in selected_spt:
    plt2 = selmags( band2_1, band2_2, s)
    ax[1,0].plot(plt2[0], plt2[1], 'o', label=f'{s}: {len(plt2[0])}')
    ax[1,0].set_xlabel(f'{band2_1}') #'-{band2_2}')
    ax[1,0].set_ylabel(band2_2)
    ax[1,0].invert_yaxis()
    ax[1,0].invert_xaxis()    
    ax[1,0].legend()
    
    ax[1,1].plot(plt2[0]-plt2[1], plt2[1], 'o', label=f'{s}: {len(plt2[0])}')
    ax[1,1].set_xlabel(f'{band2_1}-{band2_2}')
    ax[1,1].set_ylabel(band2_2)
    ax[1,1].invert_yaxis()
    ax[1,1].legend()
    

plt.show()

## Play time: Experiment with various combinations and try to answer

### 1. What happens if you start increasing the number of classes to consider ?

    
### 2. How the selection of bands influence the objects to keep ? 


### 3. Would you prefer to use  combinations with few or more objects ?


# 3. Support Vector Machine (SVM)

Support vector machine (SVM) is a way of choosing a decision boundary between different classes.

The classification boundary is provided by the hyperplane maximizing the distance between the hyperplane itself and the closest point from either class. This distance is called **margin**. Points on the margins are called **support vectors**.


<center>
<table><tr>
    <td width=400>
        <img src="images/SVM_1.png">
    </td>
    <td width=400>
        <img src="images/SVM_2.png">
    </td>
</tr></table>
    Figure 3.1. Left: Hyperplane (dashed line) separating two classes (_red_ and _green_). Right: The closest points to the hyperplane from each class constitute the "tip" of the support vectors.
</center>

The left panel of Figure 3.1 a shows two different classes distributing in a scatter plot according to variable $x_1$ and $x_2$. The right panel of Figure 3.1 explains the origin of the name support vectors: the closest points _support_ the hyperplanes (solid lines) equally distant from the decision hyperplane (dashed line).

Infinite possible boundaries can separate the two classes. SVM algorithms find the one that maximizes the distance between the supported hyperplanes.

## Hyperplanes and decision boundary

The supported hyperplanes (solid-lines in Figure 3.1) can be defined as:

> w$\cdot$x + b = +1
>
> w$\cdot$x + b = -1

where x is the coordinate on the (x1, x2) plane, w is a 2$\times$1 matrix and b a scalar. It turns out that these hyperplanes are separated by a distance 2 / ||w||. Finding the ideal classification boundary, i.e. the one maximizing the distance, is therefore a problem of minimizing the norm ||w||. This is what SVM algorithms do.

&#9733; For a complete mathematical formulation, consult the [Idiot’s guide to Support vector
machines, by Robert Berwick]( http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf).

## Separatable classes (or not)

We cannot always assume that 2 classes are separable without "contamination". That is why SVM algorithms includes a tunable parameter ($C$) which penalizes misclassifications.


<div style="text-align: center;">
<img src="images/svm-parameter-c-example.png" width=800> 
Figure 3.2. The effect of <i>C</i> parameter in the misclassifications.<br>
(Credit: <a href="https://learnopencv.com/svm-using-scikit-learn-in-python/"
 target="_blank" rel="noopener noreferrer">SVM: What makes it superior to the Maximal-Margin and Support Vector Classifiers?, by Shivam Sharma</a>)
    </img>
    </div>
    
- Small $C$ &#8594; wide margin &#8594; allows more misclassification <br>
- Large $C$ &#8594; narrow margin &#8594; allows less misclassification    

However, the SVM finds the hyperplane that maximizes the margin, and indirectly minimizes the misclassifications. In other words, SVM is not designed to minimize the contamination _per se_.


## Multiple classes

The SVM method can be applied for multiple classes as well.

<center><img src="images/svm_many_classes.png" width=400> 
Figure 3.3. SVM applied to 3 different classes.<br>
</center>

## Multiple dimensions

If our sample characterized by three parameters (X, Y, Z), then the scatter plot has 3 dimensions. The boundary between the classes in the 3-D plot is a plane. Because of the fact that the method can be extrapolated at N-dimensions, the boundary is a *hyperplane*.

<img src="images/svm_3d.png" width=400>
<center>
    Figure 3.4: Support vector machine applied for 3-D features and three classes.
</center>

## Non-linear boundaries

Sometimes, linear boundaries may not be optimal and a non-linear SVM should be used instead. The left panel of Figure 3.5 shows an 2D scatter plot of two different classes (e.g. red and green stars with different radii and temperatures) which cannot be linearly separated.

In order to find non-linear boundaries we can tackle the problem in an higher dimensional space. We use a process called **kernelization**, which consists in using a kernel function to attribute to our data a value in the additional dimension. Then, we draw the decision hyperplane into this higher dimensional space.

The central panel of Figure 3.5 shows that once the 2D data are mapped to a 3D space by attributing a $z$ value through a Gaussian-like function, the classes are easily separable by a 3D hyperplane. Projecting back the plane in 2D, we obtain the non-linear boundary (Figure 3.5, rght panel).

<img src="images/kernel.png" width=800>
<center>
    Figure 3.5. When no linear boundaries can be used the SVM method can be applied by using kernel.
</center>

## Choosing the kernel function

Useful kernel functions shall satisfy specific conditions, so that in practice only a few are used. In the example of Figure 3.5, the Gaussian Radial Basis Function is used:

> $K(x,y) = e^{-\gamma(x-y)^2}$

where $\gamma$ is a hyperparameter which shall be learned (in our example we use an arbitrary value but in principle we should use cross-validation methods).

## Final remarks on SVM

**Pros**
* Good at dealing with high dimensional data
* Works well on small data sets

**Cons**
* Picking the right kernel and parameters can be computationally intensive
* It suffers from contamination

&#9733; For further information on SVM, consult [Support Vector Machine - Classification, by Saed Sayad](http://www.saedsayad.com/support_vector_machine.htm).

# 4. Application 1: SVM in practice

## The binary problem

In this case we examine a binary classification problem where we select one class (or more that are groups into a single oen) and the rest as contaminants. The purpose is to check if we can separate efficinetly these two classes.

In [None]:
def process_data( bands2use, binary_classes2use  ):
    """
    Process input data to return arrays 
    of magnitudes and (consecutive) colors
    based on the input bands (band2use).

    Option to prepare data for binary classification
    if binary_classes2use contains classes or not.
    
    """
    pd_ml_data_mags = []   # working with magnitudes directly
    pd_ml_data_clrs = []   # taking color terms, i.e. mag1-mag2
    pd_ml_labels    = []
    pd_ml_objects   = []

    print(f'# stars with mags in: {",".join([bb for bb in bands2use])}')
    print("=========================")
    print("Type    initial    final ")
    print("-------------------------")
    init = 0 # initial total number of stars (added after each iteration)

    for sptype in unique_cls:
        indcs = classes[sptype]
        kept = []
        init += len(indcs)
        for star in indcs:
            mag_list = list(data[star][bands2use])
            # rejecting stars with missing values
            if miss_value in mag_list:
                #print('REJECTING!!! <',data[star])
                continue
            else:
                # creting the magnitude list
                mag = [ i for i in mag_list ] #data[star][bands_selected] ]

                # creating the color term (index)
                clr = [mag[i]-mag[i+1] for i in range(len(mag)-1)]

                pd_ml_data_clrs.append(clr)
                pd_ml_data_mags.append(mag)
                pd_ml_objects.append(data[star]['Name'])
                kept.append(sptype)           

                # selecting class(es) to examine for binary classifier
                if len(binary_classes2use)!=0:
                    if sptype in binary_classes2use:
        #                print(f'. keeping {sptype}')
                        label_sptype = 'SEL'
                    else: 
        #                print(f'. not considering {sptype}')
                        label_sptype = 'CON'
                    pd_ml_labels.append(label_sptype)
                else:
                    pd_ml_labels.append(sptype)

        print(f'{sptype:<4}  {len(indcs):>9} {len(kept):>8}')
    print('-'*24)
    print(f'TOTAL:  {init:>7}  {len(pd_ml_data_mags):>7}') 
    if len(binary_classes2use)!=0:
        print('='*24)
        print(f'classifying:  {len(pd_ml_labels)-pd_ml_labels.count("CON"):>10}') 
        print(f'contaminants:  {pd_ml_labels.count("CON"):>9}') 


    pd_ml_data_mags = np.asarray(pd_ml_data_mags)
    pd_ml_data_clrs = np.asarray(pd_ml_data_clrs)
    pd_ml_objects   = np.asarray(pd_ml_objects)
    pd_ml_labels    = np.asarray(pd_ml_labels)
          
    return pd_ml_data_mags, pd_ml_data_clrs, pd_ml_objects, pd_ml_labels

In [None]:
reminder()

Select here the class(es) you would like to distinguish from the rest, along with the bands to use.

In [None]:
class2keep = ['RSG']
# Select the bands you want to use here:
bands_selected = [] 

In [None]:
ml_data_mags, ml_data_clrs, ml_objects, ml_labels = process_data( bands_selected, class2keep)  

NOTE:the process_data() function examines and keeps only the sources with values across all bands. In other words, we remove sources with missing values (according to the bands selected). 

Lets print the labels to see how they look like.

In [None]:
print(ml_labels)

NOTE: If you have more than 2 bands selected the following plot will use the first two. Modify accordingly to plot other combinations.

In [None]:
fig = plt.figure(figsize=(12,10))

conts = np.where( ml_labels=='CON' )[0]
clasf = np.where( ml_labels!='CON' )[0]

plt.plot( ml_data_mags[conts][:,0], ml_data_mags[conts][:,1], 'o', 
             label='Contaminants')
plt.plot( ml_data_mags[clasf][:,0], ml_data_mags[clasf][:,1], '*', 
             label=f'Selected ({"+".join(class2keep)})')
plt.gca().invert_yaxis()

plt.xlabel(f'{bands_selected[0]}') #'-{bands_selected[1]}')
plt.ylabel(bands_selected[1])
plt.legend()
plt.show()


## Introducing train-test split

In supervised approaches we want to "teach" the algorithms what they need to learn, before start the predictions. In this case we want the SVM to identify the common properties of the two sub groups (SEL including all possible classes used, and CON as contaminants). However, if we provide all data the algorithm will learn this "by heart" and fit them perfectly (called **overfitting**), and when new data appear will probably misclassify. 

To address this, and to have a way to estimate the performance of the algorithms a standard **train-test split** it performed. As much data as possibly should enter the training sample, with typical values being 70-80%. Then, what is left is treated as a test sample, i.e. data that are not used to train the model. 

<div style="text-align: center;">
<img src="images/train-test-split.png" width=600> 
Figure 4.1. Splitting the sample into training and test sets. <br>
(Credit: G. Maravelias)
    </img>
    </div>

A better approach is to split the whole sample into **train**, **validation**, and **test** sets. In this way train set defines the model's parameters, while from the validation sample we can get the *hyperparameters* (those parameters whose values determine the learning process), and finally the test sample to evaluate the performance. We will see this and more advanced techniques to estimate performance in the next sessions (ML_Practices).



In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(ml_data_mags, ml_labels, 
                        test_size=0.3) #, random_state=42) 

print(f'- From {len(ml_objects)} sources:')
print(f'   {len(X_train)} (training)')
print(f'   {len(X_test)} (test)') 
print()
print(f'Test labels: {y_test}')


Now, let's use the classifier to fit our training set (X_train) and predict the classes of the test xamples (X_test). We will print some metrics to check the performance.

In [None]:
from sklearn import metrics

clf = SVC(kernel='linear') 
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
#print(y_pred)

print(f"Classification report:\n\n {metrics.classification_report(y_test, y_pred)}") 
print(f"Confusion matrix: \n\n {metrics.confusion_matrix(y_test, y_pred)}")

## Model evaluation metrics

There is a number of assessment metrics for the performance of a classifier. We start by introducing the idea of the **confusion matrix**. 

<center><img src="images/confusion_matrix-mod.png" width=400> 
Figure 4.1. Confusion matrix for classification.<br>
(Credit: <a href="https://towardsdatascience.com/precision-vs-recall-386cf9f89488"  target="_blank" rel="noopener noreferrer">Precision vs Recall by Shruti Saxena</a>)</center>


Defining some metrics:

$$ \rm{Precision} =  \frac{\rm{True~Positives}}{\rm{Actual~Results}} = \frac{\rm{True~ Positives}}{\rm{True~Positives\,+\,False~Positives}} $$ 

$$ \rm{Recall} = \frac{\rm{True~Positives}}{\rm{Predicted~Results}} = \frac{\rm{True~ Positives}}{\rm{True~Positives + False~Negatives}} $$ 

$$ \rm{F1-score} = 2 \times \frac{\rm{Precision}\times\rm{Recall}}{\rm{Precision}+\rm{Recall}}$$

$$ \rm{Accuracy} = \frac{\rm{True~Positives}\,+\,\rm{True~Negatives}}{\rm{Total}} $$ 

<br>
<div style="text-align: center;">
Support: number of test objects per class<br><br>
Macro avg: averaging the unweighted mean per label<br><br>
Weighted avg: averaging the support-weighted mean per label
</div>

---

Note 1: You may also encounter the terms _sensitivity_ and _specificity_ which corresponds to the recall of the positive and the negative class, repsectively, in binary problems. 

Note 2: In astrophysics we use the terms _completeness_ and _contamination_ (see [Classification, by Andy Connolly](http://connolly.github.io/introAstroML/blog/classification.html)):

$$ \rm{completeness} = \frac{\rm{True~Positives}}{\rm{All~real~Positives}} = \frac{True~Positives}{True~Positives + False~Negatives} = recall$$

$$ \rm{contamination} = \frac{False~Positives}{All~detected~Positives} = \frac{False~Positives}{True~Positives + False~Positives}$$



A prettier presentation of the same results...

In [None]:
def plot_confusion_matrix(cm,
                          target_names,
                          title='Confusion matrix',
                          cmap=None,
                          normalize=True):
    """
    given a sklearn confusion matrix (cm), make a nice plot

    Arguments
    ---------
    cm:           confusion matrix from sklearn.metrics.confusion_matrix

    target_names: given classification classes such as [0, 1, 2]
                  the class names, for example: ['high', 'medium', 'low']

    title:        the text to display at the top of the matrix

    cmap:         the gradient of the values displayed from matplotlib.pyplot.cm
                  see http://matplotlib.org/examples/color/colormaps_reference.html
                  plt.get_cmap('jet') or plt.cm.Blues

    normalize:    If False, plot the raw numbers
                  If True, plot the proportions
                  
                  
    Usage
    -----
    plot_confusion_matrix(cm           = cm,                  # confusion matrix created by
                                                              # sklearn.metrics.confusion_matrix
                          normalize    = True,                # show proportions
                          target_names = y_labels_vals,       # list of names of the classes
                          title        = best_estimator_name) # title of graph

    Citiation
    ---------
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

    """
    import matplotlib.pyplot as plt
    import numpy as np
    import itertools

    accuracy = np.trace(cm) / float(np.sum(cm))
    misclass = 1 - accuracy

    if cmap is None:
        cmap = plt.get_cmap('Blues')

    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation='nearest', cmap=cmap, alpha=0.5)
#    plt.title(title)
    cbar = plt.colorbar()
    cbar.set_label('# sources', fontsize=16)
    cbar.ax.tick_params(labelsize=16) # (fontsize=15)
   

    if target_names is not None:
        tick_marks = np.arange(len(target_names))
        plt.xticks(tick_marks, target_names, rotation=45, fontsize=15)
        plt.yticks(tick_marks, target_names, fontsize=15)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]


    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            plt.text(j, i, "{:0.3f}".format(cm[i, j]),
                     horizontalalignment="center", color="black", fontsize=14 )
                     #color="white" if cm[i, j] > thresh else "black")
        else:
            plt.text(j, i, "{:,}".format(cm[i, j]),
                     horizontalalignment="center", color="black") 
#                     color="white" if cm[i, j] > thresh else "black")


    plt.tight_layout()
    plt.ylabel('True label', fontsize=16)
    plt.xlabel('Predicted label (accuracy={:0.2f})'.format(accuracy, misclass), fontsize=16)
    plt.show()


In [None]:
plot_confusion_matrix( metrics.confusion_matrix( y_test, y_pred),
                      ['SEL','CON'],
                      title='Confusion matrix', cmap='BuPu', # for more options see: https://matplotlib.org/stable/tutorials/colors/colormaps.html
                      normalize=False  # True returns precent, False raw numbers
                      ) # YlOrBr


## Question: What happens if we start adding more classes into the selected one ?


## Question: What happens when we start changing the $C$ parameter? 

HINT: the default value is 1, so start increasing it. 


## The multi-class problem

Now we approach the same problem as a multi-class one, i.e. we are using the SVM as a classifier that can handle all classes simultaneously. 

In [None]:
reminder()

In [None]:
bands_selected = [] 

In [None]:
ml_data_mags, ml_data_clrs, ml_objects, ml_labels = process_data( bands_selected,[])  

Now if we print the labels again, we notice that the array is totally different and the multiclass output is evident.

In [None]:
print(ml_labels)

## Splitting into train and test sets for multi-class

We are using again the train-test split approach but considering all classes now.

In [None]:
indices = np.arange(len(ml_labels))
X_train, X_test, y_train, y_test = train_test_split(ml_data_mags, ml_labels,
#                                shuffle=True, stratify=ml_labels, 
                                test_size=0.3 ) #, random_state=42) 

print(f'> From {len(ml_objects)} sources we use {len(X_train)} for training and {len(X_test)} for testing.') 

print('\nStatistics per class:')
k = 0
for c in unique_cls:
    items_train = np.where( y_train==c )[0]
    items_test  = np.where( y_test==c )[0]
    items_total = np.where( ml_labels==c )[0]
    print(f'> For {c} there are {len(items_total)} sources split in {len(items_train)} (train) and {len(items_test)} (test) samples')



## Play time: Run train-test split a few times as it is  - do you notice anything strange? 

HINT: try to reduce the test_size, for the evidence to become clearer. 


## How can we correct for the effect in the previous question?

HINT: Check the documentation and find out which parameters help.


### Important take-away 

> **sklearn documentation is your friend !**

In [None]:
clf2 = SVC(kernel='linear') 
clf2.fit(X_train, y_train)
y_pred = clf2.predict(X_test)
#print(y_pred)

#confmatrix = metrics.confusion_matrix( y_test, y_pred)
print(f"Classification report:\n\n {metrics.classification_report(y_test, y_pred)}") 
print(f"Confusion matrix: \n\n {metrics.confusion_matrix(y_test, y_pred)}")

plot_confusion_matrix( metrics.confusion_matrix( y_test, y_pred),
                      unique_cls,
                      title='Confusion matrix', cmap='BuPu', # for more options see: https://matplotlib.org/stable/tutorials/colors/colormaps.html
                      normalize=False  # True returns precent, False raw numbers
                      ) # YlOrBr

## Question: How does the result changes with respect to the binary case? 


## Question: How does the result change with kernel ? 

HINT: check sklearn.svm.SVC


## Take-away point

> Accuracy is **not** the best metric to use when we have imbalanced datasets. 

# 5. Random Forest

## Decision Tree

A **Decision Tree** (**DT**) is simply a top-to-bottom tree-like structure where each node corresponds to a question (or a set of features more generally) that distinguishes objects to two groups, left and right from the node. A decision tree presents the drawback of learning extremely well the training set. That means that DTs overfit the data and they cannot predict very accurately new data.

<center><img src="images/DecisionTree.jpg/" width=400> 
Figure 5.1. Quick introduction to Decition Trees - how to fulfill an everyday need.<br>
(Credit: G. Maravelias)</center>
 

##  Random Forests

**Random Forests** (**RF**) or Random Decision Trees ([Breiman (2001), Machine Learning, 45, 5](https://doi.org/10.1023/A:1010933404324)) is a generalization of the DTs, as it utilises a multitude of decision trees. When RF are used as a classification method, for each input datum the final output is the class (/value) given by the mode of the classes of the individual trees.


<center><img src="images/RandomForests.jpg" width=800> 
Figure 5.2. Schematic description of the Random Forest classifier.<br>
(Credit: G. Maravelias)</center>


RF creates a large number of DTs through random selection of a subset of the training set as well as a random selection of features. This randomness reduces the correlation between the different DTs. Since the DTs have different conditions in their nodes and different overall structures, this diversity yield overall robust predictions. 

Once the RF has been trained, the data of an **unlabeled** source (to be classified) are  fed into all DTs of the forest. According to its properties and the nodes in each DT, it follows a specific path which leads to a given class. The final output of the RF (the prediction) is an aggregation of all DTs by means of a majority vote.

The fact that RF combines the prediction for a number of individual trees makes it an **ensemble** method.

&#9733; [Reis, Baron, & Shahaf (2019), AJ, 157, 16](https://ui.adsabs.harvard.edu/abs/2019ascl.soft03009R/abstract) provide an excellent description of RFs (for a two-class problem) and present a  probabilistic RF method which takes into account the uncertainties on the data and their labels.

## Final remarks on Random Forests

In machine learning most of the effort is actually spent on the **sample** selection and, most importantly, on the selection of the **features** used for the classification (feature engineering).

RF partially overcome the latter problem by training each DT on a different sub-set of features, hence training the algorithm to recognize the features which mostly differentiate the objects.

**PROS**
- No need of scaling or transformation of the initial data.
- Implicit feature selection.
- Suitable for large datasets with many features.


**CONS**
- Not easily interpretable.
- Hyperparameter needs good tuning for high accuracy.  

# 6. Application 2: Random Forests in practice

Using the same dataset we are not approaching the same problem using the Random Forest (both as a binary and a multiclass classifier).

---

**TASK 1: Complete the train-test split and select magnitudes to work with.**

**TASK 2: Find the proper function for RF and make predictions**

In [None]:
reminder()

In [None]:
class2keep_RF=[]  # add any class if you want to use RF as binary classifier, or keep it []
bands_selected_RF = [] #, '36um', '58um'] 

In [None]:
ml_data_mags, ml_data_clrs, ml_objects, ml_labels = process_data( bands_selected_RF, class2keep_RF)  

In [None]:
# splitting data on magnitudes or colors

X_train, X_test, y_train, y_test = train_test_split(
                                test_size=0.3) 

print(f'> From {len(ml_objects)} sources we use {len(X_train)} for training and {len(X_test)} for testing.') 

# check if in binary or multi-label mode
if 'CON' in ml_labels: # binary
    confmat_classes = class2keep_RF+['CON']
else:
    confmat_classes = unique_cls

    print('\nStatistics per class:')
    k = 0
    for c in unique_cls:
        items_train = np.where( y_train==c )[0]
        items_test  = np.where( y_test==c )[0]
        items_total = np.where( ml_labels==c )[0]
        print(f'> For {c} there are {len(items_total)} sources split in {len(items_train)} (train) and {len(items_test)} (test) samples')


In [None]:
from sklearn... import ...

clfrf = ...

y_pred = 

print(f"Classification report:\n\n {metrics.classification_report(y_test, y_pred)}") 
print(f"Confusion matrix: \n\n {metrics.confusion_matrix(y_test, y_pred)}")

plot_confusion_matrix( metrics.confusion_matrix( y_test, y_pred),
                      confmat_classes,
                      title='Confusion matrix', cmap='BuPu', # for more options see: https://matplotlib.org/stable/tutorials/colors/colormaps.html
                      normalize=False  # True returns percent, False raw numbers
                      ) # YlOrBr

## Question:  Why the results change by re-running it ?


## Question:  What is the difference on the accuracy of the algorithm if we use the colors instead of the magnitudes? Why do it?


# 7. k-Nearest Neighbors (KNN) classification

k-NN is an unsupervised method to identify clusters. Given the results from this algorithm we can then use it to perform classification. In other words, we can attribute any (new point) to the class which dominates its surroundings. 

The problem is then how to define the "neighborhood" of a point. The trivial solution would be to set a fixed radius. The issue then becomes its size: if too small, we **will not find neighbors** for "satellite" points at the edge of a class cluster; if too large, we **will lose resolution** in dense parts, effectively throwing away information. Therefore, ideally we would like to have a *variable bandwidth* selection threshold.

> One solution is to use a local average of the labels of the $k$ nearest neighbors:
>
> $y = \frac{1}{k} \sum￼_{x_i \in N_k(x)} y_{i}$
>
> where $N_k(x)$ is the neighborhood around $x_i$

In this way the classification _is not_ defined based on the distance on the parameter graph, but is rather scale-independent.

Let's see a 2D example. We got two parameters and training data that are classified as being *red* or *blue*. The question is how do we classify a new (_i.e. not part of the training set_) point? The following images are taken from [MNIST analysis using KNN, by Gerardo Durán  / ImportQ](https://importq.wordpress.com/2017/11/24/mnist-analysis-using-knn/) (we edited the first one).

<table><tr>
    <td width=400>
        <img src="images/knn_neigh_initial.jpg">
        <center>Figure 7.1.a. Training data already possessing a red or blue label, and an arbitrary new point to be classified.</center>
    </td>    
    <td width=400>
        <img src="images/knn_neigh.gif">
        <center>Figure 7.1.b. Classification using majority votes of $k$ neighbors, for different values of $k$.</center>
    </td>
    <td width=400>
        <img src="images/knn_neigh_mult.gif">
        <center>Figure 7.1.c. For a fixed $k$, the model can be thought as of a function of the location in the parameter space. Note that the appearing dots are not part of the training set. Instead, they represent the predicted classifications if the new point would fall on that position.</center>
    </td>
</tr></table>

The panels a and b already suggest that the KNN classification will be affected by the choice of the **hyperparameter** $k$.

# 8. Exercise 3: k-NN in practice

In this case we are not only going to apply the algorithm but we are going to explore  the influence of $k$ hyperparameter. 

---

**TASK 1: Complete the missing steps**

**TASK 2: Find the function and the accuracy metric**

**TASK 3: Perform the fitting of the algorithm for various values for k.**

**TASK 4: Plot the accuracy with number of clusters**

In [None]:
reminder()

In [None]:
# add classes to keep and bands here

In [None]:
# process data here


In [None]:
X_train, X_test, y_train, y_test = train_test_split(  
                        stratify = ...,
                        test_size=0.3) 

print(f'- From {len(ml_objects)} sources:')
print(f'   {len(X_train)} (train)')
print(f'   {len(X_test)} (test)') 

In [None]:
from sklearn... import ...

from sklearn... import ...

# PERFORM CLASSIFICATION FOR VARIOUS VALUES OF k

# for each 'k', store the classifier and predictions on test sample
classifiers = []
predictions = []
accuracies_train, accuracies_test  = [], []
kvals = np.arange(1,15,1) #[1, 3, 10] # k values to be used

classifiers, predictions = [], []

for k in kvals:
    
    KNN = ...
    KNN.fit(X_train, y_train)
    y_pred = KNN.predict(X_test)
    #print(y_pred)
    
    classifiers.append(KNN)
    predictions.append(y_pred)
    accuracies_test.append( ...( y_test, y_pred))
    accuracies_train.append( ...( y_train, KNN.predict(X_train)))

    print(f"Classification report:\n\n {metrics.classification_report(y_test, y_pred)}") 
    print(f"Confusion matrix: \n\n {metrics.confusion_matrix(y_test, y_pred)}")
    
    
    

In [None]:
## Plotting results

nn = [ clasf.n_neighbors for clasf in classifiers]

plt.plot(nn, ... , label='training')
plt.plot(nn, ... , label='test')

plt.xlabel('$K$')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

### Question: What do you notice? 
