# Stellar Blends Classification

### In this notebook we run the un-normalized and normalized datasets through the MuyGPyS classifier (a python classifying function that uses the MuyGPS  Gaussian process hyperparameter estimation method), and compare the resulting accuracies.

**Note:** Must have run `data_normalization.ipynb` to continue.

In [15]:
# from MuyGPyS import config
# config.update("muygpys_jax_enabled", False)

import numpy as np
import pandas as pd
import random
from tqdm import tqdm
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from MuyGPyS.examples.classify import do_classify
from MuyGPyS.gp.deformation import F2, Isotropy
from MuyGPyS.gp.hyperparameter import Parameter, Parameter as ScalarParam
from MuyGPyS.gp.kernels import RBF, Matern
from MuyGPyS.gp.noise import HomoscedasticNoise
from MuyGPyS.optimize import Bayes_optimize
from MuyGPyS.optimize.loss import LossFn, cross_entropy_fn



### Read in all flattened data (normalized and un-normalized):

In [16]:
from glob import glob

# read normalized data csv file names from the data directory
norm_data_names = glob('../data/data-norm/max-pixel-all/*.csv')
# get rid of "../data/data-norm/max-pixel-all/"
norm_data_names = [name.split('/')[-1] for name in norm_data_names]
# norm_data_names[:10]

In [17]:
# sort the names by their numbers
norm_data_names.sort(key=lambda x: x.split('_')[1])
norm_data_names[:10]

['nthroot_0.0_data.csv',
 'nthroot_0.03448_data.csv',
 'nthroot_0.03448.csv',
 'nthroot_0.06897_data.csv',
 'nthroot_0.06897.csv',
 'nthroot_0.1034_data.csv',
 'nthroot_0.1034.csv',
 'nthroot_0.1379_data.csv',
 'nthroot_0.1379.csv',
 'nthroot_0.1724_data.csv']

### Define a function that generates "one-hot" values.

This essentially just takes our truth labels of 0 and 1, and does the following conversions for use in the classifier:
- 0 to [1., -1.]
- 1 to [-1., 1.]

In [18]:
def generate_onehot_value(values):
    onehot = []
    for val in values:
        if val == 0:
            onehot.append([1., -1.])
        elif val == 1:
            onehot.append([-1., 1.])
    return onehot

### Run the classifier on each dataset

For each dataset (un-normalized and normalized) in `data_files`, this for loop does the following:
- Separate labels from data
- Split up data between training and testing
    - `test_size` is the fraction of the data you want to use for testing, where 0.5 means half of the data is used for testing and half for training.
    - `random_state` makes each dataset get trained and tested on the same number of stars and galaxies.
- Gets the one-hot values for the testing and training labels
- Gets `train` and `test` into the proper format for the classifier, a dictionary with the keys: 
    - 'input': 
    - 'output':
    - 'lookup':
- Does the classification (`do_classify`)
- Computes the accuracy of the classifier for the given dataset, by compairing predicted labels to truth labels.

In [19]:
nn_kwargs_exact = {"nn_method": "exact", "algorithm": "ball_tree"}

nn_kwargs_hnsw = {"nn_method": "hnsw"}

k_kwargs_rbf ={
            "kernel": RBF(
                 deformation=Isotropy(
                     metric=F2,
                 length_scale=Parameter(1.0, (1e-2, 1e2)),
                 ),
            ),
            "noise": HomoscedasticNoise(1e-5),
            }
k_kwargs_mattern= { "kernel": Matern(
             smoothness=ScalarParam(0.5),
             deformation=Isotropy(
                 metric=F2,
                 length_scale=Parameter(1.0, (1e-2, 1e2)),
             ),
         ),
         "noise": HomoscedasticNoise(1e-5),
         }

In [20]:
norm_name = []
my_accuracy = []
for path in tqdm(norm_data_names):
    path1 = '../data/data-norm/max-pixel-all/' + path
    data = pd.read_csv(path1,na_values='-')
    data.fillna(0,inplace=True)
    data_label = ''.join(path.split('.')[:2])
    truth_labels = data.iloc[:, 0].values
    image_data = data.iloc[:, 1:].values

    X_train, X_test, y_train, y_test = train_test_split(image_data, truth_labels, test_size=0.2, random_state=42)

    print("=============== ", data_label, " ===============")
    print('Training data:', len(y_train[y_train==0]), 'single stars and', len(y_train[y_train==1]), 'blended stars')
    print('Testing data:', len(y_test[y_test==0]), 'single stars and', len(y_test[y_test==1]), 'blended stars')

    onehot_train, onehot_test = generate_onehot_value(y_train), generate_onehot_value(y_test)

    train = {'input': X_train, 'output': onehot_train, 'lookup': y_train}
    test = {'input': X_test, 'output': onehot_test, 'lookup': y_test}

    print("Running Classifier on", data_label)
    #Switch verbose to True for more output


    muygps, nbrs_lookup, surrogate_predictions = do_classify(
                                test_features=np.array(test['input']), 
                                train_features=np.array(train['input']), 
                                train_labels=np.array(train['output']), 
                                nn_count=15,
                                batch_count=200,
                                loss_fn=cross_entropy_fn,
                                opt_fn=Bayes_optimize,
                                k_kwargs=k_kwargs_mattern,
                                nn_kwargs=nn_kwargs_hnsw,
                                verbose=False)
    predicted_labels = np.argmax(surrogate_predictions, axis=1)
    accur = np.around((np.sum(predicted_labels == np.argmax(test["output"], axis=1))/len(predicted_labels))*100, 3)
    norm_name.append(''.join(data_label.split('_')[-3:]))
    my_accuracy.append(accur)
    print("Total accuracy for", data_label, ":", accur, '%')

  0%|          | 0/79 [00:00<?, ?it/s]

Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_00_data


  1%|▏         | 1/79 [00:02<02:49,  2.17s/it]

Total accuracy for nthroot_00_data : 55.715 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_003448_data


  3%|▎         | 2/79 [00:04<02:43,  2.12s/it]

Total accuracy for nthroot_003448_data : 80.517 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_003448


  4%|▍         | 3/79 [00:06<02:38,  2.09s/it]

Total accuracy for nthroot_003448 : 78.646 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_006897_data


  5%|▌         | 4/79 [00:08<02:37,  2.10s/it]

Total accuracy for nthroot_006897_data : 80.334 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_006897


  6%|▋         | 5/79 [00:10<02:29,  2.01s/it]

Total accuracy for nthroot_006897 : 78.848 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_01034_data


  8%|▊         | 6/79 [00:12<02:23,  1.97s/it]

Total accuracy for nthroot_01034_data : 80.646 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_01034


  9%|▉         | 7/79 [00:14<02:19,  1.94s/it]

Total accuracy for nthroot_01034 : 78.206 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_01379_data


 10%|█         | 8/79 [00:15<02:11,  1.85s/it]

Total accuracy for nthroot_01379_data : 80.646 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_01379


 11%|█▏        | 9/79 [00:17<02:12,  1.90s/it]

Total accuracy for nthroot_01379 : 78.499 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_01724_data


 13%|█▎        | 10/79 [00:20<02:28,  2.15s/it]

Total accuracy for nthroot_01724_data : 79.435 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_01724


 14%|█▍        | 11/79 [00:22<02:26,  2.15s/it]

Total accuracy for nthroot_01724 : 78.187 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_02069_data


 15%|█▌        | 12/79 [00:24<02:15,  2.02s/it]

Total accuracy for nthroot_02069_data : 80.664 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_02069


 16%|█▋        | 13/79 [00:26<02:15,  2.05s/it]

Total accuracy for nthroot_02069 : 78.609 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_02414_data


 18%|█▊        | 14/79 [00:28<02:13,  2.06s/it]

Total accuracy for nthroot_02414_data : 80.279 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_02414


 19%|█▉        | 15/79 [00:30<02:07,  2.00s/it]

Total accuracy for nthroot_02414 : 79.123 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_02759_data


 20%|██        | 16/79 [00:32<02:10,  2.07s/it]

Total accuracy for nthroot_02759_data : 79.618 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_02759


 22%|██▏       | 17/79 [00:34<02:03,  1.99s/it]

Total accuracy for nthroot_02759 : 78.481 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_03103_data


 23%|██▎       | 18/79 [00:36<02:07,  2.10s/it]

Total accuracy for nthroot_03103_data : 74.684 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_03103


 24%|██▍       | 19/79 [00:38<02:00,  2.00s/it]

Total accuracy for nthroot_03103 : 79.362 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_03448_data


 25%|██▌       | 20/79 [00:40<01:57,  2.00s/it]

Total accuracy for nthroot_03448_data : 80.04 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_03448


 27%|██▋       | 21/79 [00:42<02:02,  2.12s/it]

Total accuracy for nthroot_03448 : 77.399 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_03793_data


 28%|██▊       | 22/79 [00:44<01:57,  2.07s/it]

Total accuracy for nthroot_03793_data : 80.059 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_03793


 29%|██▉       | 23/79 [00:47<02:03,  2.20s/it]

Total accuracy for nthroot_03793 : 73.454 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_04138_data


 30%|███       | 24/79 [00:49<01:56,  2.12s/it]

Total accuracy for nthroot_04138_data : 80.407 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_04138


 32%|███▏      | 25/79 [00:51<01:55,  2.14s/it]

Total accuracy for nthroot_04138 : 79.031 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_04483_data


 33%|███▎      | 26/79 [00:53<01:51,  2.10s/it]

Total accuracy for nthroot_04483_data : 79.857 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_04483


 34%|███▍      | 27/79 [00:55<01:46,  2.05s/it]

Total accuracy for nthroot_04483 : 79.068 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_04828_data


 35%|███▌      | 28/79 [00:57<01:47,  2.10s/it]

Total accuracy for nthroot_04828_data : 76.041 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_04828


 37%|███▋      | 29/79 [01:00<01:50,  2.20s/it]

Total accuracy for nthroot_04828 : 73.326 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_05172_data


 38%|███▊      | 30/79 [01:01<01:43,  2.11s/it]

Total accuracy for nthroot_05172_data : 77.16 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_05172


 39%|███▉      | 31/79 [01:03<01:38,  2.06s/it]

Total accuracy for nthroot_05172 : 79.196 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_05517_data


 41%|████      | 32/79 [01:05<01:36,  2.05s/it]

Total accuracy for nthroot_05517_data : 71.583 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_05517


 42%|████▏     | 33/79 [01:07<01:31,  1.98s/it]

Total accuracy for nthroot_05517 : 78.683 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_05862_data


 43%|████▎     | 34/79 [01:10<01:32,  2.05s/it]

Total accuracy for nthroot_05862_data : 74.17 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_05862


 44%|████▍     | 35/79 [01:11<01:25,  1.95s/it]

Total accuracy for nthroot_05862 : 78.775 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_06207_data


 46%|████▌     | 36/79 [01:13<01:23,  1.93s/it]

Total accuracy for nthroot_06207_data : 74.39 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_06207


 47%|████▋     | 37/79 [01:15<01:19,  1.90s/it]

Total accuracy for nthroot_06207 : 70.006 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_06552_data


 48%|████▊     | 38/79 [01:17<01:18,  1.91s/it]

Total accuracy for nthroot_06552_data : 79.857 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_06552


 49%|████▉     | 39/79 [01:19<01:18,  1.97s/it]

Total accuracy for nthroot_06552 : 75.142 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_06897_data


 51%|█████     | 40/79 [01:21<01:16,  1.97s/it]

Total accuracy for nthroot_06897_data : 78.554 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_06897


 52%|█████▏    | 41/79 [01:23<01:14,  1.95s/it]

Total accuracy for nthroot_06897 : 78.554 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_07241_data


 53%|█████▎    | 42/79 [01:25<01:13,  1.99s/it]

Total accuracy for nthroot_07241_data : 77.527 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_07241


 54%|█████▍    | 43/79 [01:27<01:11,  1.98s/it]

Total accuracy for nthroot_07241 : 70.703 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_07586_data


 56%|█████▌    | 44/79 [01:29<01:10,  2.03s/it]

Total accuracy for nthroot_07586_data : 75.197 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_07586


 57%|█████▋    | 45/79 [01:31<01:08,  2.02s/it]

Total accuracy for nthroot_07586 : 78.536 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_07931_data


 58%|█████▊    | 46/79 [01:33<01:04,  1.97s/it]

Total accuracy for nthroot_07931_data : 79.673 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_07931


 59%|█████▉    | 47/79 [01:35<01:01,  1.93s/it]

Total accuracy for nthroot_07931 : 73.216 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_08276_data


 61%|██████    | 48/79 [01:37<01:00,  1.96s/it]

Total accuracy for nthroot_08276_data : 75.436 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_08276


 62%|██████▏   | 49/79 [01:39<00:59,  1.97s/it]

Total accuracy for nthroot_08276 : 76.812 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_08621_data


 63%|██████▎   | 50/79 [01:40<00:55,  1.90s/it]

Total accuracy for nthroot_08621_data : 79.508 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_08621


 65%|██████▍   | 51/79 [01:42<00:53,  1.91s/it]

Total accuracy for nthroot_08621 : 69.584 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_08966_data


 66%|██████▌   | 52/79 [01:44<00:51,  1.91s/it]

Total accuracy for nthroot_08966_data : 78.628 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_08966


 67%|██████▋   | 53/79 [01:46<00:49,  1.92s/it]

Total accuracy for nthroot_08966 : 77.839 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_0931_data
[91mData point [100.] is not unique. 1 duplicates registered. Continuing ...[0m
[91mData point [100.] is not unique. 2 duplicates registered. Continuing ...[0m
[91mData point [100.] is not unique. 3 duplicates registered. Continuing ...[0m
[91mData point [100.] is not unique. 4 duplicates registered. Continuing ...[0m
[91mData point [100.] is not unique. 5 duplicates registered. Continuing ...[0m
[91mData point [100.] is not unique. 6 duplicates registered. Continuing ...[0m
[91mData point [100.] is not unique. 7 duplicates registered. Continuing ...[0m
[91mData point [100.] is not unique. 8 duplicates registered. Continuing ...[0m
[91mData point [100.] is not unique. 9 duplicates registered. Continuing ...[0m


 68%|██████▊   | 54/79 [01:48<00:49,  1.97s/it]

Total accuracy for nthroot_0931_data : 69.382 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_0931


 70%|██████▉   | 55/79 [01:50<00:47,  1.98s/it]

Total accuracy for nthroot_0931 : 69.602 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_09655_data


 71%|███████   | 56/79 [01:52<00:46,  2.00s/it]

Total accuracy for nthroot_09655_data : 70.317 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_09655


 72%|███████▏  | 57/79 [01:54<00:43,  1.96s/it]

Total accuracy for nthroot_09655 : 77.766 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on norm_1_datacsv


 73%|███████▎  | 58/79 [01:56<00:39,  1.88s/it]

Total accuracy for norm_1_datacsv : 80.646 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_10_data


 75%|███████▍  | 59/79 [01:58<00:38,  1.90s/it]

Total accuracy for nthroot_10_data : 75.142 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_10


 76%|███████▌  | 60/79 [02:00<00:36,  1.90s/it]

Total accuracy for nthroot_10 : 72.097 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on norm_1csv


 77%|███████▋  | 61/79 [02:02<00:35,  1.95s/it]

Total accuracy for norm_1csv : 80.866 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on norm_2_datacsv


 78%|███████▊  | 62/79 [02:04<00:33,  1.98s/it]

Total accuracy for norm_2_datacsv : 79.472 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on norm_2csv


 80%|███████▉  | 63/79 [02:06<00:31,  1.94s/it]

Total accuracy for norm_2csv : 78.628 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on norm_21_datacsv


 81%|████████  | 64/79 [02:08<00:28,  1.90s/it]

Total accuracy for norm_21_datacsv : 80.426 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on norm_21csv


 82%|████████▏ | 65/79 [02:10<00:27,  1.93s/it]

Total accuracy for norm_21csv : 80.444 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on norm_3_datacsv


 84%|████████▎ | 66/79 [02:12<00:25,  1.93s/it]

Total accuracy for norm_3_datacsv : 74.904 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on norm_3csv


 85%|████████▍ | 67/79 [02:13<00:23,  1.94s/it]

Total accuracy for norm_3csv : 77.527 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on norm_31_datacsv


 86%|████████▌ | 68/79 [02:15<00:20,  1.87s/it]

Total accuracy for norm_31_datacsv : 68.85 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on norm_31csv


 87%|████████▋ | 69/79 [02:17<00:18,  1.84s/it]

Total accuracy for norm_31csv : 69.712 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on norm_4_datacsv


 89%|████████▊ | 70/79 [02:19<00:16,  1.84s/it]

Total accuracy for norm_4_datacsv : 78.334 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on norm_4csv


 90%|████████▉ | 71/79 [02:20<00:14,  1.79s/it]

Total accuracy for norm_4csv : 77.472 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on norm_41_datacsv


 91%|█████████ | 72/79 [02:22<00:12,  1.74s/it]

Total accuracy for norm_41_datacsv : 68.96 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on norm_41csv


 92%|█████████▏| 73/79 [02:24<00:10,  1.72s/it]

Total accuracy for norm_41csv : 68.997 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on norm_5_datacsv


 94%|█████████▎| 74/79 [02:26<00:08,  1.80s/it]

Total accuracy for norm_5_datacsv : 79.6 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on norm_5csv


 95%|█████████▍| 75/79 [02:28<00:07,  1.81s/it]

Total accuracy for norm_5csv : 80.499 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on norm_51_datacsv


 96%|█████████▌| 76/79 [02:30<00:05,  1.86s/it]

Total accuracy for norm_51_datacsv : 76.518 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on norm_51csv
[91mData point [100.] is not unique. 1 duplicates registered. Continuing ...[0m
[91mData point [100.] is not unique. 2 duplicates registered. Continuing ...[0m
[91mData point [100.] is not unique. 3 duplicates registered. Continuing ...[0m


 97%|█████████▋| 77/79 [02:32<00:04,  2.04s/it]

Total accuracy for norm_51csv : 68.043 %
Training data: 12072 single stars and 9729 blended stars
Testing data: 3037 single stars and 2414 blended stars
Running Classifier on nthroot_log_datacsv


 99%|█████████▊| 78/79 [02:34<00:02,  2.05s/it]

Total accuracy for nthroot_log_datacsv : 77.747 %
Training data: 12122 single stars and 9679 blended stars
Testing data: 2987 single stars and 2464 blended stars
Running Classifier on nthroot_logcsv


100%|██████████| 79/79 [02:36<00:00,  1.98s/it]

Total accuracy for nthroot_logcsv : 77.27 %





In [21]:
accura = pd.DataFrame({'norm_name': norm_name, 'accuracy': my_accuracy})
accura.to_csv('../data/muygps-max-all-accuracy.csv', index=False)

accura = pd.read_csv('../data/muygps-max-all-accuracy.csv')   
accura.sort_values(by=['accuracy'], inplace=True)
accura.T

Unnamed: 0,0,76,67,71,72,53,50,54,68,36,...,23,63,64,74,1,5,57,7,11,60
norm_name,nthroot00data,norm51csv,norm31datacsv,norm41datacsv,norm41csv,nthroot0931data,nthroot08621,nthroot0931,norm31csv,nthroot06207,...,nthroot04138data,norm21datacsv,norm21csv,norm5csv,nthroot003448data,nthroot01034data,norm1datacsv,nthroot01379data,nthroot02069data,norm1csv
accuracy,55.715,68.043,68.85,68.96,68.997,69.382,69.584,69.602,69.712,70.006,...,80.407,80.426,80.444,80.499,80.517,80.646,80.646,80.646,80.664,80.866


In [22]:
accura.nlargest(10, 'accuracy')

Unnamed: 0,norm_name,accuracy
60,norm1csv,80.866
11,nthroot02069data,80.664
5,nthroot01034data,80.646
57,norm1datacsv,80.646
7,nthroot01379data,80.646
1,nthroot003448data,80.517
74,norm5csv,80.499
64,norm21csv,80.444
63,norm21datacsv,80.426
23,nthroot04138data,80.407


<u>***Note:*** Each time you run the classifier will result in different accuracies.</u>

### As you can see, all 5 normalization techniques do much better than the un-normalized data, with some performing better than others.

### Things you can try, to see how they affect the classifier accuracy:
- Play around with different values of `test_size`. What does testing on more or less data do?
- Play around with different parameters that are passed to `do_classify`. Start with `nn_count` and `embed_dim`(For what those arguments are, and a full list of all of the arguments you can pass to do_classify, look at the function `do_classify` in `/MuyGPyS/examples/classify.py`).
- Try generating more cutouts using `generating_ZTF_cutouts_from_ra_dec.ipynb`. How does having more testing and training data affects the classifier?
- Play around with the parameters used to make the cutouts. What happens if you remove blend cuts? Can the classifier classify blends? What is you increase the seeing limit? Can the classifier classify images with bad atmoshperic quality?

<hr style="border:2px solid gray"> </hr>

## <u>**Optional Step:**</u>
### Running each dataset through the classifier multiple times, testing and training on varying amounts of data, different random states, and plotting the accuracy outcomes

- Each time you run the following steps, you change:
    - `test_size`: This is used in `train_test_split`, and changes the size of the testing and training datasets, which effects the accuracy of the classifier.
    - `random_state`: This is used in `train_test_split`, and changes the ratio of how many stars-to-galaxies get tested on.
- You can set how many times to run the classifier with varying test sizes and random states by setting `num_runs`, and you can manually change the test_size values by editing `test_size_values`.

In [23]:
test_size_values = [.2, .25, .33, .4, .5, .75]
num_runs = 3

In [24]:
# def run_classifier(image_data, truth_labels, test_size, state):
#     X_train, X_test, y_train, y_test = train_test_split(image_data, truth_labels, test_size=test_size, random_state=state)
#     onehot_train, onehot_test = generate_onehot_value(y_train), generate_onehot_value(y_test)
#     train = {'input': X_train, 'output': onehot_train, 'lookup': y_train}
#     test = {'input': X_test, 'output': onehot_test, 'lookup': y_test}
#     #Switch verbose to True for more output
#     muygps, nbrs_lookup, surrogate_predictions= do_classify(
#                         test_features=np.array(test['input']),
#                         train_features=np.array(train['input']), 
#                         train_labels=np.array(train['output']), 
#                         nn_count=20,
#                         batch_count=200,
#                         loss_fn=cross_entropy_fn,
#                         opt_fn=Bayes_optimize,
#                         k_kwargs=k_kwargs_mattern,
#                         nn_kwargs=nn_kwargs_hnsw, 
#                         verbose=False) 
#     predicted_labels = np.argmax(surrogate_predictions, axis=1)
#     accuracy = (np.sum(predicted_labels == np.argmax(test["output"], axis=1))/len(predicted_labels))*100
#     return accuracy

In [25]:
# from time import perf_counter
# start = perf_counter()

# accuracies = pd.DataFrame({'test_size': test_size_values})

# # Setting progress bar for each time the classifier will be run during this step
# pbar = tqdm(total=len(norm_data_names)*num_runs*len(test_size_values), desc='Running classifier', leave=True)

# for path in norm_data_names:
#     path1 = '../data/data-norm/max-pixel-all/' + path
#     data = pd.read_csv(path1,na_values='-')
#     data.fillna(0,inplace=True)
#     data_label = ''.join(path.split('.')[:2])
#     truth_labels = data.iloc[:, 0].values
#     image_data = data.iloc[:, 1:].values
#     all_acc_dataset = []
#     for test_size in test_size_values:
#         acc = []
#         idx = 1
#         while idx <= num_runs:
#             accuracy = run_classifier(image_data, truth_labels, test_size, state=random.randint(0, 10000))
#             acc.append(accuracy)
#             pbar.update(1)
#             idx += 1
#         avg_acc = np.average(acc)
#         all_acc_dataset.append(avg_acc)
#     temp_df = pd.DataFrame({str(data_label): all_acc_dataset})
#     accuracies = pd.concat([accuracies, temp_df], axis=1)
# end = perf_counter()
# print(f"Time taken to run the classifier on all datasets: {(end-start)/60} minutes")
# accuracies.to_csv('max-all-accuracies.csv', index=False)
# display(accuracies)

In [26]:
# plt.figure(figsize=(12,4))

# for path in norm_data_names:
#     path1 = '../data/data-norm/max-pixel-all/' + path
#     data = pd.read_csv(path1,na_values='-')
#     data.fillna(0,inplace=True)
#     data_label = ''.join(path.split('.')[:2])
#     # data_label = 'Normalized {} {}'.format(*path.split('_')[:2])
#     plt.plot(accuracies['test_size'].values, accuracies[data_label].values, label=data_label)

# plt.title("MuyGPs Stellar Blending 2-class")    
# plt.legend(fontsize=10)   
# plt.tick_params(labelsize=10)
# plt.xlabel("Test size (as a ratio to full data size)", fontsize=10)
# plt.ylabel("Accuracy [%]", fontsize=10)
# plt.savefig("muygps_max_all_abs.png")
# plt.show()

In [27]:
# accuracies = pd.read_csv('max-all-accuracies.csv')
# np.max(accuracies.values, axis=1)

In [28]:
# idcs = np.argmax(accuracies.values, axis=1)
# accuracies.iloc[:, idcs]

There is no benefit to normalization images with a division by maximum over the entire data. All the max accuracies above occur where the max-all normalization is not applied. However, it seems neural net model is in favor of max-all normalization instead of image by image max normalization.