<h1>Preprocessing</h1>
The Dataset are made in this way:
<ul>
    <li>objid       = Object Identifier, the unique value that identifies the object in the image catalog used by the CAS</li>
    <li>ra          = Right Ascension angle (at J2000 epoch)</li>
    <li>dec         = Declination angle (at J2000 epoch)</li>
    <li>u           = Ultraviolet filter in the photometric system</li>
    <li>g           = Green filter in the photometric system</li>
    <li>r           = Red filter in the photometric system</li>
    <li>i           = Near Infrared filter in the photometric system</li>
    <li>z           = Infrared filter in the photometric system</li>
    <li>run         = Run Number used to identify the specific scan</li>
    <li>rereun      = Rerun Number to specify how the image was processed</li>
    <li>camcol      = Camera column to identify the scanline within the run</li>
    <li>field       = Field number to identify each field</li>
    <li>specobjid   = Unique ID used for optical spectroscopic objects (this means that 2 different observations with the same spec_obj_ID must share the output class)</li>
    <li>redshift    = redshift value based on the increase in wavelength</li>
    <li>plate       = plate ID, identifies each plate in SDSS</li>
    <li>mjd         = Modified Julian Date, used to indicate when a given piece of SDSS data was taken</li>
    <li>fiberid     = fiber ID that identifies the fiber that pointed the light at the focal plane in each observation</li>
    <li>class       = object class (galaxy, star or quasar object)</li>
</ul>

In [10]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_regression, f_regression, SelectKBest
from imblearn.over_sampling import SMOTE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from scipy.stats import pearsonr
from sklearn.model_selection import cross_val_score, learning_curve
from sklearn.metrics import accuracy_score
import pickle
from sklearn.metrics import confusion_matrix

Let's start loading the dataset in a dataframe

In [11]:
# dfStars = pd.read_csv('Skyserver_12_15_2020 3 45 07 AM.csv', na_values="?")
dfStars = pd.read_csv('FileCSV/star_classification.csv', na_values="?")
dfStars

Unnamed: 0,objid,ra,dec,u,g,r,i,z,run,rerun,camcol,field,specobjid,class,redshift,plate,mjd,fiberid
0,1.237661e+18,135.689107,32.494632,23.87882,22.27530,20.39501,19.16573,18.79371,3606,301,2,79,6.543777e+18,GALAXY,0.634794,5812,56354,171
1,1.237665e+18,144.826101,31.274185,24.77759,22.83188,22.58444,21.16812,21.61427,4518,301,5,119,1.176014e+19,GALAXY,0.779136,10445,58158,427
2,1.237661e+18,142.188790,35.582444,25.26307,22.66389,20.60976,19.34857,18.94827,3606,301,2,120,5.152200e+18,GALAXY,0.644195,4576,55592,299
3,1.237663e+18,338.741038,-0.402828,22.13682,23.77656,21.61162,20.50454,19.25010,4192,301,3,214,1.030107e+19,GALAXY,0.932346,9149,58039,775
4,1.237680e+18,345.282593,21.183866,19.43718,17.58028,16.49747,15.97711,15.54461,8102,301,3,137,6.891865e+18,GALAXY,0.116123,6121,56187,842
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,1.237679e+18,39.620709,-2.594074,22.16759,22.97586,21.90404,21.30548,20.73569,7778,301,2,581,1.055431e+19,GALAXY,0.000000,9374,57749,438
99996,1.237679e+18,29.493819,19.798874,22.69118,22.38628,20.45003,19.75759,19.41526,7917,301,1,289,8.586351e+18,GALAXY,0.404895,7626,56934,866
99997,1.237668e+18,224.587407,15.700707,21.16916,19.26997,18.20428,17.69034,17.35221,5314,301,4,308,3.112008e+18,GALAXY,0.143366,2764,54535,74
99998,1.237661e+18,212.268621,46.660365,25.35039,21.63757,19.91386,19.07254,18.62482,3650,301,4,131,7.601080e+18,GALAXY,0.455040,6751,56368,470


In [None]:
dfStars.head().T

Here I check the distribution of the Labeled-Class

In [None]:
dfStars['class'].value_counts()

In [None]:
fig,axes = plt.subplots(1, 1, figsize=(5,5), sharey=True)
sns.countplot(x='class',data = dfStars, ax = axes, order = dfStars['class'].value_counts().index)
plt.show()

Check same info about the feature of the DataSet

In [None]:
dfStars.describe()

In [None]:
dfStars.info()

Check for Nan or Null value

In [None]:
dfStars.isna().sum(axis=0)

Checking if there is some duplicate

In [None]:
dfStars[dfStars.duplicated(keep=False)]

<h1>Normalization of the classed-class</h1>



In [12]:
# Normalize the classed-class with numerical data
dfStarsNormalize = dfStars
dfStarsNormalize['class'] = dfStarsNormalize['class'].replace({'GALAXY': 0, 'STAR': 1, 'QSO': 2})

<h1>Split the dataset in training set and test set.</h1>

In [13]:
# Split the dataset into features and classs
X = dfStarsNormalize.drop('class', axis=1)
y = dfStarsNormalize['class']
X.shape, y.shape

((100000, 17), (100000,))

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3, 
                                                    random_state=0)
print('Shape of the train and test set')
print(X_train.shape) # type: ignore
print(X_test.shape) # type: ignore
print()
print('Distribution of the labeled-classes')  
print()
print("Training Set")
print(pd.Series(y_train).value_counts())
print()
print("Test Set")
print(pd.Series(y_test).value_counts())

Shape of the train and test set
(70000, 17)
(30000, 17)

Distribution of the labeled-classes

Training Set
0    41636
1    15044
2    13320
Name: class, dtype: int64

Test Set
0    17809
1     6550
2     5641
Name: class, dtype: int64


<h1>Feature Selection</h1>

In [15]:
X_train.drop(['objid','run','rerun','camcol','field','fiberid'], axis = 1, inplace=True) # type: ignore

<h5>USING: mutual_info_regression</h5>

In [None]:
# Calcola il coefficiente di informazione mutua per ogni feature
mi = mutual_info_regression(X_train, y_train)
mi.sort()

<h5>USING: f_regretion</h5>

In [None]:
# Eseguiamo il test di ANOVA utilizzando f_regression
scores, pvalues = f_regression(X_train, y_train)

The following DataFrame shows the score achieved by each attribute:

In [None]:
dfScores = pd.DataFrame()
dfScores['ra'] = [scores[0], mi[0]]
dfScores['dec'] = [scores[1], mi[1]]
dfScores['u'] = [scores[2], mi[2]]
dfScores['g'] = [scores[3], mi[3]]
dfScores['r'] = [scores[4], mi[4]]
dfScores['i'] = [scores[5], mi[5]]
dfScores['z'] = [scores[6], mi[6]]
dfScores['specobjid'] = [scores[7], mi[7]]
dfScores['redshift'] = [scores[8], mi[8]]
dfScores['plate'] = [scores[9], mi[9]]
dfScores['mjd'] = [scores[10], mi[10]]
dfScores.index = ['f_regretion', 'mutual_info_regretion'] # type: ignore
dfScores.T

In [None]:
# Crea un oggetto SelectKBest con il valore di k desiderato (ad esempio 10)
selector = SelectKBest(score_func=f_regression, k=10)

# Adatta il selector al dataset e seleziona le feature
X_selected = selector.fit_transform(X_train, y_train)

# Stampa le feature selezionate
print("Feature selezionate:")
print(X_train.columns[selector.get_support()]) # type: ignore

<h3>Delete Features</h3>
Now we need to drop the useless features

In [None]:
X_train.keys() # type: ignore

Delete all the feautre with score < 0.1

In [16]:
X_train.drop(['ra', 'dec', 'u'], axis = 1, inplace=True) # type: ignore

<h1>Rebalance Dataset</h1>
I decide to use the method resample() for rebalance my DataSet.<br>
I add features into the two minority classes (STAR, QSO) until they reach the class with the higer number of examble (GALAXY). <br>
In the following snip of code I implement an over sampling using SMOTE.

In [17]:
# using oversampling with SMOTE to deal with imbalanced data
sm = SMOTE(random_state=42)
X_train_after_balancing, y_train_after_balancing = sm.fit_resample(X_train, y_train) # type: ignore

print(X_train_after_balancing.shape) # type: ignore
y_train_after_balancing.value_counts() # type: ignore

(124908, 8)


0    41636
1    41636
2    41636
Name: class, dtype: int64

In [None]:
# from sklearn.utils import resample
# 
# # I need to use a Dataframe with the labeled-class
# df_balanced = X_train
# df_balanced['class'] = y_train # type: ignore
# 
# # Split the dataset in 3 part, one for each class
# df_class_0 = df_balanced[df_balanced["class"] == 0] # type: ignore
# df_class_1 = df_balanced[df_balanced["class"] == 1] # type: ignore
# df_class_2 = df_balanced[df_balanced["class"] == 2] # type: ignore
# 
# # Find the majority class
# min_class = df_balanced["class"].value_counts().idxmin() # type: ignore
# 
# # We oversample the least numerous classes
# df_class_1_over = resample(df_class_1,
#                             replace=True, # Sample with replacement
#                             n_samples=len(df_class_0), # Match number in majority class
#                             random_state=42) # reproducible results
# df_class_2_over = resample(df_class_2,
#                             replace=True, # Sample with replacement
#                             n_samples=len(df_class_0), # Match number in majority class
#                             random_state=42) # reproducible results
# 
# # Join the tre dataframe
# df_balanced = pd.concat([df_class_0, df_class_1_over, df_class_2_over]) # type: ignore
# 
# # Mix the data
# df_balanced = df_balanced.sample(frac=1, random_state=42)
#
# df_balanced['class'].value_counts() # type: ignore
# print(df_balanced.shape) # type: ignore
# We need to split again the labeled-class from the other attribute inside the  ***df_balanced***
# X_train_after_balancing = df_balanced.drop('class', axis=1) # type: ignore
# y_train_after_balancing = df_balanced['class'] # type: ignore

<h1 style="font-weight: bold">CLASSIFICATION</h1>

<h2>KNeighborsClassifier</h2>
I tried to perform KNN with different k, for understand which is the best. <br>
I decided to use the euclidean and the manhattan distance, for make a comparison

In [None]:
X_test_drop = X_test.drop(['ra', 'dec', 'u', 'objid','run','rerun','camcol','field','fiberid'], axis = 1) # type: ignore

k_neighbors = 100
metrics = ['euclidean', 'manhattan']

accuracy_total = []
for k in range(1, k_neighbors+1, 1):
    accuracy_k = []
    for metric in metrics:
        knn = KNeighborsClassifier(n_neighbors=k, metric=metric)
        knn.fit(X_train_after_balancing, y_train_after_balancing)
        y_predKNN = knn.predict(X_test_drop)
        accuracy = accuracy_score(y_test, y_predKNN)
        accuracy_k.append(accuracy)
        # print(f"For metric = {metric} and k = {k}:      ACCURACY = {accuracy}")
    accuracy_total.append(accuracy_k)

accuracy_df = pd.DataFrame(np.array(accuracy_total), columns=metrics)
k_df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100], columns=['k'])
accuracy_join= k_df.join(accuracy_df)

The following snip of code shows us the trend of the KNN's accuracy for the euclidean and manhattan distance.<br> 
The graph for each distance it's almost the same. But the accurasy reach just the 67% in the peak, for k = 3.

In [None]:
plt.plot(accuracy_join['k'], accuracy_join['euclidean'], label='euclidean')
plt.plot(accuracy_join['k'], accuracy_join['manhattan'], label='manhattan')

plt.legend()
plt.xlabel('k')
plt.ylabel('accuracy score')
plt.show()

In [None]:
accuracy_ed = []
accuracy_md = []

for i in range(200):
    accuracy_ed.append(accuracy_total[i][0])
    accuracy_md.append(accuracy_total[i][1])

max_index = max(enumerate(accuracy_ed), key=lambda x: x[1])[0]
print(f"Euclidean Distance: The value of k with the higher accuracy is {max_index}. Accurasy = {accuracy_ed[max_index]}")

max_index = max(enumerate(accuracy_md), key=lambda x: x[1])[0]
print(f"Manhattan Distance: The value of k with the higher accuracy is: {max_index}. Accurasy = {accuracy_md[max_index]}")

accuracyKNN = accuracy_md[max_index]

<h2>Decision Tree</h2>

In [None]:
X_test_drop = X_test.drop(['ra', 'dec', 'u', 'objid','run','rerun','camcol','field','fiberid'], axis = 1) # type: ignore

# Create the model
clf = DecisionTreeClassifier()

# Fit the model
clf = clf.fit(X_train_after_balancing, y_train_after_balancing)
y_predDT = clf.predict(X_test_drop)

# Check the accuracy
accuracyDT = accuracy_score(y_test, y_predDT)
print(f"Accuracy: {accuracyDT:.2f}")

# Check the performance
scores = cross_val_score(clf, X_train_after_balancing, y_train_after_balancing, cv=5)
print(f"Performances: {scores}")
print(f"Average performance: {scores.mean():.2f}")



<h1>Random Forest</h1>

In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

X_test_drop = X_test.drop(['ra', 'dec', 'u', 'objid','run','rerun','camcol','field','fiberid'], axis = 1) # type: ignore

# Create the model
clf = RandomForestClassifier()

# Fit the model
clf.fit(X_train_after_balancing, y_train_after_balancing)

# Check the accuracy
y_predRF = clf.predict(X_test_drop)
accuracyRF = accuracy_score(y_test, y_predRF)
print(f"Accuracy: {accuracyRF:.2f}")

Accuracy: 0.98


In [None]:
from sklearn.model_selection import cross_val_score, learning_curve
import matplotlib.pyplot as plt

# Utilizza la cross-validation per valutare l'accuratezza del modello
scores = cross_val_score(clf, X_train_after_balancing, y_train_after_balancing, cv=10)

# Stampa i risultati della cross-validation
print("Cross-validation scores:", scores)
print("Mean score:", scores.mean())

# Disegna la curva di apprendimento del modello
train_sizes, train_scores, test_scores = learning_curve(clf, X_train_after_balancing, y_train_after_balancing, cv=10)

plt.plot(train_sizes, train_scores.mean(axis=1), 'o-', color='r', label="Training score")
plt.plot(train_sizes, test_scores.mean(axis=1), 'o-', color='g', label="Cross-validation score")

plt.title("Learning Curves (Random Forest)")
plt.xlabel("Training examples")
plt.ylabel("Score")
plt.legend(loc="best")
plt.show()

<h2>Bayesian Classifier</h2>

In [None]:
from sklearn.naive_bayes import GaussianNB
from scipy.stats import pearsonr

X_test_drop = X_test.drop(['ra', 'dec', 'u', 'objid','run','rerun','camcol','field','fiberid'], axis = 1) # type: ignore

# Create the model
gnb = GaussianNB()

# Fit the model
gnb.fit(X_train_after_balancing, y_train_after_balancing)

# Check the accuracy
y_predBC = gnb.predict(X_test_drop)
accuracyBC = accuracy_score(y_test, y_predBC)
print(f"Accuracy: {accuracyBC:.2f}")

<h3>ACCURACY</h3>
Now we will analize the accuracy of all the tried classifier.<br>
We can note that the best algorithm is Random Forest


In [None]:
print(f"Accuracy of Random Forest: {accuracyRF:.2f}")
print(f"Accuracy of Decision Tree: {accuracyDT:.2f}")
print(f"Accuracy of KNN: {accuracyKNN:.2f}")
print(f"Accuracy of Bayesian Classifier: {accuracyBC:.2f}")

<h1 style="font-weight: bold">Save The Model<h1>

<p style="size: 12pt">I'll save the model of Random Forest in a file, for using them in an external application</p>

In [None]:
with open('starClassificatinApp/model.pkl', 'wb') as f:
    pickle.dump(clf, f)

In [None]:
X_train_after_balancing.keys()

<h1>Evaluation of the model</h1>

In [21]:
from sklearn.metrics import confusion_matrix

# Calcola la confusion matrix
confusion_matrix = confusion_matrix(y_test, y_predRF)
print(confusion_matrix)

[[17408    76   325]
 [    4  6546     0]
 [  320     1  5320]]
