<br>
<h1 style = "font-size:60px; font-family:Monaco ; font-weight : normal; background-color: #000055 ; color : #9999ff; text-align: center; border-radius: 50px 50px;">SETI - Breakthrough Listen<br>EfficientNet B3</h1>
<br>

## Description of problem
The vast distances between stars/galaxies and the dramatic effects of the inverse square law make finding electromagnetic signals of intelligent civilizations a challenging task.  Breakthroughs in machine learning using computer vision and time series analysis are tools that may bring us closer to achieving this goal.  The purpose of this notebook, and the greater experiment that this notebook belongs to is to attempt to push the state of the art ever closer to realizing this goal.

## Seti Breakthrough-Listen Dataset
The purpose of this notebook is to use a pre-trained efficient-net model to classify cadence samples as negative or positive(anomolous signal).  The data consists of "cadence snippets taken from the Green Bank Telescope", which is a digital spectrometer that generates spectrograms using the Fourier Transform technique.  The data represent signal intensity as a function of frequency and time.  

The "Cadence" is described in the "Data Information" section of the competition:  
5 minutes on star “A”, then 5 minutes on star “B”, then back to star “A” for 5 minutes, then “C”, then back to “A”, then finishing with 5 minutes on star “D”. One set of six observations (ABACAD) is referred to as a “cadence”.  

The shape of the data is (6, 273, 256), where 273 represents the time (5 minutes) dimension, and 256 represents the frequency dimension.  


## This notebook originally created by: 
https://www.kaggle.com/anirudhg15/seti-et-baseline-efficientnetb3

In [13]:
# Imports
import os
import math
import numpy as np
import pandas as pd

# DL Modules
import tensorflow as tf
from tensorflow import keras
import efficientnet.tfkeras as efn

# ML / Data Prep
from sklearn import model_selection
from sklearn.metrics import accuracy_score

In [2]:
# Test if GPU available
tf.test.is_gpu_available()

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


True

In [3]:
datadir: str = "/data/data/datasets/seti/"
train_labels: pd.DataFrame = pd.read_csv(os.path.join(datadir, "train_labels.csv"))

In [4]:
def id_to_path(sample_id: str) -> str:
    """ Return full path to numpy file from sample id """
    return os.path.join(datadir, "train", sample_id[0], f"{sample_id}.npy")

In [5]:
X = train_labels['id'].values
y = train_labels['target'].values

X_trainval, X_test, y_trainval, y_test = model_selection.train_test_split(
    X, 
    y, 
    test_size=.2, 
    random_state=42, 
    stratify=y
)

X_train, X_val, y_train, y_val = model_selection.train_test_split(
    X_trainval, 
    y_trainval, 
    test_size=.2, 
    random_state=42, 
    stratify=y_trainval
)

print(f"LENGTHS ; X_train : {len(X_train)}, X_val {X_val.shape}, X_test {X_test.shape}")
print(f"Number of positive samples ; y_train : {np.sum(y_train==1)}, y_val: {np.sum(y_val==1)}, y_test: {np.sum(y_test==1)}")
print(f"Ratio of positive samples ; y_train : {np.sum(y_train==1)/len(y_train):.3f}, y_val: {np.sum(y_val==1)/len(y_val):.3f}, y_test: {np.sum(y_test==1)/len(y_test):.3f}")

LENGTHS ; X_train : 38400, X_val (9600,), X_test (12000,)
Number of positive samples ; y_train : 3840, y_val: 960, y_test: 1200
Ratio of positive samples ; y_train : 0.100, y_val: 0.100, y_test: 0.100


In [6]:
X_train[:3]

array(['0add6ccf9b2e9ce', '1e4cb3f498e29ca', 'b6606b085a618f1'],
      dtype=object)

### Dataset class

In [7]:
class DataGenerator(keras.utils.Sequence):
    def __init__(self, x_set, y_set=None, batch_size=32):
        self.x , self.y = x_set, y_set
        self.batch_size = batch_size
        self.is_train = False if y_set is None else True
        
    def __len__(self):
        return math.ceil(len(self.x) / self.batch_size)
    
    def __getitem__(self, idx):
        batch_ids = self.x[idx * self.batch_size: (idx + 1) * self.batch_size]
        if self.y is not None:
            batch_y = self.y[idx * self.batch_size: (idx + 1) * self.batch_size]
        
        list_x = [np.load(id_to_path(x))[::2] for x in batch_ids]
        batch_x = np.moveaxis(list_x,1,-1)
        batch_x = batch_x.astype("float") / 255
        
        if self.is_train:
            return batch_x, batch_y
        else:
            return batch_x

### Model Hyperparams

In [8]:
input_size = (273, 256, 3)
batch_size = 16
n_epoch = 2
seed = 42

### Model Definition

In [9]:
model = tf.keras.Sequential([
        efn.EfficientNetB3(input_shape=input_size,weights='imagenet',include_top=False),
        keras.layers.GlobalAveragePooling2D(),
        keras.layers.Dense(1, activation='sigmoid')
        ])

model.summary()
model.compile(optimizer=keras.optimizers.Adam(learning_rate=1e-4),
              loss='binary_crossentropy', metrics=[keras.metrics.AUC()])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
efficientnet-b3 (Model)      (None, 9, 8, 1536)        10783528  
_________________________________________________________________
global_average_pooling2d (Gl (None, 1536)              0         
_________________________________________________________________
dense (Dense)                (None, 1)                 1537      
Total params: 10,785,065
Trainable params: 10,697,769
Non-trainable params: 87,296
_________________________________________________________________


### Model Training

In [10]:
train = DataGenerator(X_train, y_train, batch_size=batch_size)
val = DataGenerator(X_val, y_val, batch_size=batch_size)
test = DataGenerator(X_test, batch_size=batch_size)

model.fit(train, validation_data=val, epochs=n_epoch)

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7fa19c1db340>

### Save model

In [11]:
model.save(os.path.join("models", "effnet_cadences_E2"))

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: models/effnet_cadences_E2/assets


### Test model

In [12]:
preds_test = model.predict(test).flatten()

In [18]:
np.round(preds_test[:4]).astype(np.uint8)

array([0, 0, 0, 0], dtype=uint8)

In [16]:
y_test[:4]

array([0, 0, 0, 0])

In [19]:
acc = accuracy_score(y_test, np.round(preds_test).astype(np.int32))
acc

0.9190833333333334

### Continue training

In [20]:
model.fit(train, validation_data=val, epochs=n_epoch)

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7fa04c06ab80>

In [21]:
preds_test = model.predict(test).flatten()
acc = accuracy_score(y_test, np.round(preds_test).astype(np.int32))
acc

0.9205833333333333

In [22]:
model.save(os.path.join("models", "effnet_cadences_E4"))

INFO:tensorflow:Assets written to: models/effnet_cadences_E4/assets


In [None]:
for epoch in range(6, 21, 2):
    model.fit(train, validation_data=val, epochs=n_epoch)
    preds_test = model.predict(test).flatten()
    acc = accuracy_score(y_test, np.round(preds_test).astype(np.int32))
    print("TEST ACCURACY AFTER EPOCH {epoch} : {acc:.6f}")
    model.save(os.path.join("models", f"effnet_cadences_E{epoch}"))

Epoch 1/2