# HomeWork 04

<font size=1>
    Nikolaos Vasilas & Elias Kyritsis, 2024. All rights reserved.
</font>  

<div class="alert alert-block alert-warning" style="margin-top: 20px">
       
**Exercise 1: Support Vector Machine Classification**
    <br><br>
    Following this week's lecture, let's use the Support Vector Machine and train an algorithm on real astronomical data. In modern astrophysics, the continuous flow of data and the need to analyze it accurately has made machine learning algorithms very helpful for observational and theoretical astronomers.<br><br>

**Objective:** 
<br>
Stellar classification relies on a star's spectral data to sort stars into distinct categories. The current system, called the Morgan–Keenan (MK) classification, is based on the earlier Hertzsprung–Russell (HR) classification and uses chromaticity along with Roman numerals to indicate a star’s size. In this exercise, we’ll be using the Absolute and Apparent Magnitude, B-V Color Index, Parallax Distance, Surface Temperature, and Luminosity of stars to distinguish between Giants and Dwarfs.
    
    
> Keep in mind that:
    >
> - The Absolute magnitude can be calculated as:
$$
\begin{equation}
M = m + 5(\log_{10}(p) +1)  \tag{Eq. 1}
\end{equation}
$$
<br>
    > - The Surface Temperature in Kelvin is given by the formula:
    $$
\begin{equation}
T = \frac{4600}{0.92 \cdot (B - V) + 1.7} + \frac{4600}{0.92 \cdot (B - V) + 0.62}\tag{Eq. 2}
\end{equation}
$$
<br>
    > - The luminosity of a star:
 $$
\begin{equation}
\frac{L}{L_\odot} = 10^{0.4 (4.85 - M)}\tag{Eq. 3}
\end{equation}
$$
    > <br><br>
> Where \( $M$ \) is the Absolute Magnitude, \( $m$ \) is the Apparent Magnitude, \( $p$ \) is the parallax (distance between the Sun and the star), \( $T$ \) is the Surface Temperature, and \( $L$ \) is the Luminosity of stars. 

The dataset contains 3180 rows and 7 columns (features). The features included in the dataset are:
     
    Vmag (Apparent Magnitude)
    Plx (Parallax Distance)
    B-V (B-V Color Index)
    Amag (Absolute Magnitude)
    Temp/Teff (K) (Surface Temperature)
    Luminosity (W) (Luminosity of stars)
    TargetClass (The target classification label)
<br>
    
**Tasks:**
<br>

- **Ex.1.1**: Load the data from the file ```HW04_data.csv```. 
    - Using the ```seaborn``` library plot a `pairplot` for the features, color-coding the instances accorind to the class each instance bellongs.
    - Plot a histogram of the target labels. Comment in one sentence whether there is a posibility that whatever model you will try to train will encounter problems.
    - Add a new column named "class_numeric", where you will replace each class with numerical values, e.g., 'Dwarf' $\leftrightarrow$ 0,and 'Giant' $\leftrightarrow$ 1.<br>
<br>
    
- **Ex.1.2**: Constuct an SVM model for dinsiguishing between 'Dwarf', and 'Giant'. Use as features all the feature of the dataset. In addition, use the default SVM hyperparameters. For this example use a simple train-test protocol (no validation, no CV):
    - Split data in _train_ and _test_
    - Fit on _train_
    - Assess performance on _test_ using the _accuracy_ metric
    - Print the accuracy of your model.<br>
<br>
    
- **Ex.1.3**: A standard technique for the oprimization of the hyperparameters is to use the Grid Search method. Grid Search CV it is a cross-validation technique for finding the optimal hyperparameter values from a given set of hyperparameters in a grid. Use ```sklearn.model_selection.GridSearchCV```, to tune the **_C_** hyperparameter **AND** the **kernel** hyperparameter (and any other hyperparameter you wish) of the SVM applied on your data.
    - For the **C** and **kernel** hyperparameters try the value ranges **C = [1,5,10]** and **kernel = ['linear', 'poly', 'rbf']**. Set the number of folds to 5 (i.e., `cv=5`). Print the best score and the best fit parameters that the GridSearchCV returns.
    - By using the optimal hyperparameters combination, evaluate the performance of your best model on a hold-out test that you kept out in the begining of **Ex.1.3** (separated from your a master training set). Print the accuracy of your best model, the final classification report and the confusion matrix of your best model. 
    - What do you notice, by comparing the accuracy scores of Ex.1.2 and Ex.1.3 ? Comment in one-two sentences.
<br><br>

- **[Bonus] Ex 1.4**: <br>
    Run the same protocol as in **Ex.1.3** but now train for the features, 'Amag', 'Temp/Teff (K)', and 'Luminosity (W)'. Compare the accuracy from that of **Ex.1.3**: do you expect that result? 
<br>
    
- **[Bonus] Ex 1.5**: <br>
   Plot the decision function of the best models of **Ex.1.3**, and **Ex.1.4** in the (Surface Temperature, Luminosity) space.


**Hints:**

- The documentation of SVM algorithm is [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
- The documentation of ```sklearn.model_selection.GridSearchCV``` is [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
- Load the data with ```pandas.read_csv```. See documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) 
- For the third step of **Ex.1.1** see the documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html)
- For **Ex 1.5** use:
    ```plt.xscale('log')``` and 
     ```plt.yscale('log')```
- Don't forget to nomalize before running SVM $-$ use the `StandardScaler` (documentation [here](https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.StandardScaler.html))
- Don't forget to have fun!


In [51]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
import warnings
warnings.filterwarnings("ignore")

scale = StandardScaler()

df = pd. read_csv('C:/Users/souli/OneDrive/Desktop/ML/HW04_data.csv')

#normalize

df['class_numeric']=df['TargetClass']
df['class_numeric'].replace({'Dwarf': 0,'Giant':1}, inplace=True) 

X=df[['Vmag', 'Plx','B-V','Amag','Temp/Teff (K)','Luminosity (W)']] #2d
columns=pd.DataFrame(['Vmag', 'Plx','B-V','Amag','Temp/Teff (K)','Luminosity (W)'])
print(df)
y=np.array(df['class_numeric'])

class Klasi():
    def __init__(self,X,y):
        self.X = X
        self.y = y
        self.df = pd.DataFrame(X, columns=columns) 
        self.df['class_numeric'] = y
    def splitdata(self):
        X_train,X_test,y_train,y_test= train_test_split(self.X, self.y, test_size=0.2,random_state=2)
        return(X_train,X_test,y_train,y_test) 
    def plotdata(self):
        sns.pairplot(df, hue ='class_numeric')
        plt.show()
        sns.histplot(data=df, x="class_numeric")
        plt.show()    
        return()
        
x=Klasi(X,y)
#x.plotdata()   
X_train, X_test, y_train, y_test = x.splitdata()
X_train = scale.fit_transform(X_train) 
X_test = scale.transform(X_test) 

svc = svm.SVC()
svc.fit(X_train, y_train)
yhat=svc.predict(X_test)
#X_test_original1 = scale.inverse_transform(X_test)


k=accuracy_score(y_test, yhat,normalize=True) ##!!prosoxi ta orismata 
print("Accuracy score:",k)

##Ex.1.3

#GridSearchCV  #Use it to tune the C hyperparameter AND the kernel hyperparameter (and any other hyperparameter
#you wish) of the SVM applied on your data.

parameters = {'kernel':('linear', 'poly', 'rbf'), 'C':[1, 5, 10]}

clf = GridSearchCV(svc, parameters,cv=5)
clf.fit(X_train,y_train)
print("Best score:", clf.best_score_)
print("Best parameters:", clf.best_params_)

# evaluate the performance of your best model
#2o erwrthma ex1.3 xrisimopoiw ta test dedomena 

best_model = svm.SVC( C=1,kernel='rbf')
best_model=best_model.fit(X_train,y_train)
yhattest=best_model.predict(X_test)
best_acc=accuracy_score(y_test,yhattest)
print("Accuracy score for the best model:",best_acc)

#Ex.1.4




      Vmag    Plx    B-V TargetClass       Amag  Temp/Teff (K)  \
0     5.99  13.73  1.318       Dwarf  16.678353    4089.516341   
1     8.70   2.31 -0.045       Dwarf  15.518060   10723.648049   
2     5.77   5.50  0.855       Dwarf  14.471813    5120.212718   
3     6.72   5.26 -0.015       Giant  15.324929   10316.282219   
4     8.76  13.44  0.584       Giant  19.401996    6030.905633   
...    ...    ...    ...         ...        ...            ...   
3175  7.79  12.92  0.772       Dwarf  18.346313    5366.546245   
3176  7.29   3.26  1.786       Dwarf  14.856088    3408.552355   
3177  8.29   6.38  0.408       Giant  17.314103    6837.926421   
3178  6.11   2.42  1.664       Dwarf  13.029077    3562.420235   
3179  8.81   1.87  1.176       Dwarf  15.169208    4356.363995   

      Luminosity (W)  class_numeric  
0       6.976392e+21              0  
1       2.031178e+22              0  
2       5.324104e+22              0  
3       2.426613e+22              1  
4       5.677712e