In [1]:
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns


# Classifying Galaxies in Galaxy Zoo

The aim of this project is to train a model to classying galaxies based on their distinct shapes, according to the **Galaxy Zoo Challenge*** on Kaggle. 
We work on the dataset which consists of 61,578 images with corresponding labels, that represent 37 questions that were been asked to users about the galaxies images they were looking at.

<div align="center">
  <h1>Questions Tree</h1>
  <img title="Questions Tree" src="Images/galaxy_tree.png" alt="Questions Tree">
  <p>Retrieved from [https://arxiv.org/abs/1308.3496](https://arxiv.org/abs/1308.3496) </p>
</div>


## Exploring Data Analysis

In the graph below we show the confidence level of each answerd reported in the dataset. Notice that each colum of the dataset is named like "ClassA.B" where A is the question and B are the different possible answers.

<div align="center">
  <img title="Confidence level" src="Images/confidence.png">
</div>
Notice that two classes with most responses of high confidence level are 1 and 6. We decided to reduce our features based on this observation and selected classes 1.1,1.2 and 6.1. 

* **Class1.1**: This galaxy simply smooth and rounded.
* **Class1.2**: This galaxy has a sign of a disk.
* **Class6.1** This galaxy is odd.

Therefore, we choose these three features to classyfing the shapes of galaxies, reducing this category into **elliptical**, **spiral** or **odd**.

First, we have removed the objects that more than 50% of the respondents agreed that doesn't represent a galaxy at all. Then we select for spirals and ellipticals the objects that fall into these two categories for more than 70% of the people and also that were not considered odd by more than 90% of them.
Finally we identified as odd galaxies the ojects that ere considered unusual by more than 63%.

We were able to further reduce our data set and by taking 5000 values of each of the three selected classes so as to be computationally efficient and start with an homogenous sample, ending up with 15000 rows of sample data.

In [2]:
labels = pd.read_csv("Galaxy_data/training_solutions_rev1.csv")
labels=labels.drop(labels[labels['Class1.3'] > 0.5].index)
labels = labels[['GalaxyID', 'Class1.1', 'Class1.2', 'Class6.1']]
labels['Result'] = 'i'
labels.loc[(labels['Class1.1'] > 0.7 ) & (labels['Class1.2'] < 0.3 ) & (labels['Class6.1'] < 0.1 ), 'Result'] = 'e'
labels.loc[(labels['Class1.1'] < 0.3 ) & (labels['Class1.2'] > 0.7 ) & (labels['Class6.1'] < 0.1 ) , 'Result'] = 's'
labels.loc[ (labels['Class6.1'] > 0.63 ), 'Result'] = 'o'

elip_df = labels[labels['Result'] == 'e']
spiral_df = labels[labels['Result'] == 's']
odd_df = labels[labels['Result'] == 'o']

# Sample 5000 values from each category

category1_sampled = elip_df.sample(n=5000, random_state=42)
category2_sampled = spiral_df.sample(n=5000, random_state=42)
category3_sampled = odd_df.sample(n=5000, random_state=42)

# Concatenate the sampled DataFrames back together

sampled_df = pd.concat([category1_sampled, category2_sampled, category3_sampled])

galaxy =  sampled_df.sort_values(by='GalaxyID')
galaxy_ids = galaxy['GalaxyID'].to_numpy()
galaxy.head()

Unnamed: 0,GalaxyID,Class1.1,Class1.2,Class6.1,Result
2,100053,0.765717,0.177352,0.0,e
6,100123,0.462492,0.456033,0.687647,o
16,100263,0.179654,0.81853,0.913055,o
19,100322,0.091987,0.908013,0.0,s
30,100458,0.820908,0.081499,0.921161,o


### Correlation matrix

The correlation matrix suggests strong negative correlations between Ellipticals and Spiral (this was expected since one excludes the other because of the way the dataset was constructed), while the relationships with Odd galaxies are weaker and in different directions (negative with Ellipticals and positive with Spirals)

<div align="center">
  <h1 Mean Image</h1>
  <img title="Corr" src="Images/Corr_table.png">
</div>

## Image Preprocessing

In our resultant dataset, we have notice that the galaxy is almost perfectly centered in every images and a good percentage of the pixels around the borders is almost black and does not provide any substantial informations. Therefore, we decided to cut our sample following this strategy:

* We decide to work with grayscaled images to reduce the memory requirement of storing the all dataset. Each image is stored as an array where each element represent the lumonisity intensity of a single pixel, that ranges between 0(black) and 255(white).

* We calculate a **mean image** by averaging the corresponding pixels across all samples.

<div align="center">
  <h1 Mean Image</h1>
  <img title="Mean" src="mean.jpg">
</div>

* We compute and plot the correlation matrix on the mean image to decide which part of the image should be cropped.

<div align="center">
  <h1>Correlation Matrix of the Mean Image</h1>
  <img title="Correletion matrix" src="Images/corr_matrix_mean.png">
</div>



From the resultant matrix, we decided to extract the central 256x256 pixels for each images.

<div align="center">
  <h1>Original images</h1>
  <img title="original images" src="Images/original_galaxy.png">
</div>




<div align="center">
  <h1>Cropped images</h1>
  <img title="cropped images" src="Images/cropped_galaxies.png">
</div>



## Principal Component Analysis

We perform the PCA using scikit-learn module in order to reduce the number of features of our dataset. We use the *n_components* parameter equal to 0.8 to retain 80% of the variance that needs to be explained. After this operation, we reduced the number of feauters to **20**. This percentage of the variance is the one that allowed us to reach the best value of accuracy without increasing massively the number of feauters ( with 70% we have 9 feauters and 5% less of accuracy, with 90% we have 78 feauters with only 0.5% more of accuracy).  
In the graph below we report the eigenvalues of the PCA respect to the number of components.


<div align="center">
  <img title="pca_exp" src="Images/pca_exp_ratio.png">
</div>

In the plot below we have plot three components of the PCA. Elliptical and spiral galaxies predominantly occupy distinct regions of the space, while galaxies classified as "odd" exhibit a more dispersed distribution that meets the other two categories evenly. This behaviour can be attributed to the distinctive characteristics of ellipticals and spirals, instead odd galaxies still possess elements shared by both spirals and ellipticals.

In [3]:
from IPython.display import IFrame
IFrame(src="Images/grafico_3d_interattivo.html", width='100%', height=800)

# Entropy and Symmetry features

After PCA , we assessed symmetry features across various axes (vertical and diagonal) for the galaxies so as to reduce the numbers of false positives and true negatives and to help the model to discern odd galaxies from ellipticals and spirals.


In [7]:
pca = np.load("Proccessed_data/pca_fast.npy", allow_pickle=True)
df = pd.DataFrame(pca)
df.columns = df.iloc[0]
total_pixels = len(df.columns) 
header = [f'Comp_{i}' for i in range(1, total_pixels + 1)]
df.columns = header
new_column =galaxy['Result']
new_column = new_column.values.astype(str)
df.insert(0, 'Shape', new_column)
new_column = np.load("Proccessed_data/entr.npy",allow_pickle=True)
df.insert(1, 'Entropy', new_column)
new_column = np.load("Proccessed_data/entr.npy",allow_pickle=True)
df.insert(2, 'Vert_Symm', new_column)
new_column = np.load("Proccessed_data/entr.npy",allow_pickle=True)
df.insert(3, 'Diag_Symm', new_column)
df.head()

Unnamed: 0,Shape,Entropy,Vert_Symm,Diag_Symm,Comp_1,Comp_2,Comp_3,Comp_4,Comp_5,Comp_6,...,Comp_11,Comp_12,Comp_13,Comp_14,Comp_15,Comp_16,Comp_17,Comp_18,Comp_19,Comp_20
0,e,0.025132,0.025132,0.025132,4346.050167,-1456.188393,-1896.784438,-676.409366,36.773572,195.571665,...,-227.055876,25.33482,33.682129,57.169679,-255.691422,-112.776652,-293.350937,392.706494,-121.912697,-102.918707
1,o,0.023503,0.023503,0.023503,889.894082,90.320612,1142.557293,268.943849,1152.781022,-798.075256,...,102.786704,88.255408,82.456882,-471.795176,148.323664,-1485.647633,-418.415224,-468.584332,383.944222,7.113392
2,o,0.023948,0.023948,0.023948,1568.656494,-1966.178868,-1025.526464,686.554955,-964.185359,3750.701429,...,1403.017487,439.064731,1609.516629,-148.428221,297.017676,-579.905823,576.214745,193.280893,28.143759,-5.141216
3,s,0.017978,0.017978,0.017978,-3270.449328,-1755.345616,2012.692208,245.920395,170.642006,69.427892,...,-403.672541,-83.69011,0.764526,-164.029216,-334.458436,-151.886122,300.297738,667.889421,-177.189337,-421.773141
4,o,0.023738,0.023738,0.023738,251.866158,992.014197,-1065.144293,1611.747247,-1334.036183,-2011.712083,...,770.157136,206.304658,-1367.498712,-128.088789,404.980605,-1425.601187,-71.209145,-408.029391,-430.674039,277.177246


## Build the classifier using Random Forest

We have implement Random Forest Hyperparameter Tuning using Sklearn that help to fine-tune the models. The *Shape* column will be the target to be predicted. Employing **train_test_split**, we partition the dataset into test and train sets, with a ratio of 20% for testing and 80% for training.
Then, we have used **GridSearchCV** to identify the optimal parameters for the model. 

These are the hyperparameter we have used in the grid:


* **n_estimators**: The number of trees in the forest. It takes values from the list (50, 100, 150, 200, 300). 

* **max_depth**: The maximum depth of the trees. It takes values from the list (None, 10, 20). A deeper tree can capture more complex relationships but may lead to overfitting.

* **min_samples_split**: The minimum number of samples required to split an internal node. It takes values from the list (2, 5, 10). 

* **min_samples_leaf**: The minimum number of samples required to be at a leaf node. It takes values from the list (1, 2, 4). 

* **max_features**: The number of features to consider when looking for the best split. It takes values from the list ('auto', 'sqrt', 'log2'). 

These are the accuracy report with the best parameters obtained by GridSearchView:

* max_depth = **20**
* max_features = **log2**
* min_samples_leaf = **1**
* min_samples_split = **2**
* n_estimators = **300**
  

# Accuracy of the Classifier

In this section we report the accuracy report of our analysis. The best accuracy achieved is ~**82%**. As we expected, the odd shape is the most difficult to identify for our model.

In [8]:
with open('classification_results.txt', 'r') as file:
    saved_results = file.read()

print(saved_results)

Accuracy: 0.82

Classification Report:
              precision    recall  f1-score   support

           e       0.83      0.84      0.83      1014
           o       0.79      0.80      0.79      1001
           s       0.82      0.80      0.81       985

    accuracy                           0.81      3000
   macro avg       0.81      0.81      0.81      3000
weighted avg       0.81      0.81      0.81      3000





## Confusion matrix

From the confusion matrix, we can assest that they highest number of **false positive** (123 galaxies) are the spiral galaxies that were mistakenly classified as odd, while the majority of **false negative** (101 galaxies) are the elliptical galaxies that were mistakenly classified as spiral. Possible explanations to this behaviour are the following:

* Spiral galaxies often exhibit intricate structures, arms, and irregularities that may resemble the features associated with odd galaxies. The classifier may struggle to differentiate between certain types of spiral and odd galaxies. Additionally, the presence of unusual or asymmetric features in some spiral galaxies could contribute to misclassifications as odd.

* Elliptical galaxies are characterized by their smooth and featureless appearance, lacking the prominent arms seen in spiral galaxies. The misclassification of elliptical galaxies as spiral may occur when there are subtle details or variations in brightness that resemble spiral structures.

<div align="center">
  <img tit§le="Conf_matrix" src="Images/Confus_matrix.png">
</div>
