# Training a Neural Net on PLAsTICC dataset

Based on [meaninglesslives' public kernel](https://www.kaggle.com/meaninglesslives/simple-neural-net-for-time-series-classification).

See data definitions [here](https://www.kaggle.com/c/PLAsTiCC-2018/data) and detailed information on the study [here](./data/data_note.pdf).

**Table of Contents:**

* [Import Libraries](#Import-Libraries)
* [Load and Transform Data](#Load-and-Transform-Data)
* [Merge in Metadata](#Merge-in-Metadata)
* [Scale the Input](#Scale-the-Input)
* [Pickle the Training Data](#Pickle-the-Training-Data)
* [Set up Training Losses](#Set-up-Training-Losses)
* [Define Keras Model](#Define-Keras-Model)
* [Train the Model](#Train-the-Model)
* [Examine the Confusion Matrix](#Examine-the-Confusion-Matrix)
* [Predictions on the Test Set](#Predictions-on-the-Test-Set)

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import gc # Garbage collection

## Load and Transform Data

In [2]:
gc.enable()

train = pd.read_csv('./data/Space_Data/training_set.csv', engine='python')

FileNotFoundError: [Errno 2] No such file or directory: './data/Space_Data/training_set.csv'

First, create the square of the ratio of flux to error in measuring flux, and the flux times the previous ratio:

In [None]:
train['flux_ratio_sq'] = np.power(train['flux'] / train['flux_err'], 2.0)
train['flux_by_flux_ratio_sq'] = train['flux'] * train['flux_ratio_sq']

Next, create features based on typical aggregate statistics (note, can [apply multiple transforms at once](http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once)):

In [None]:
aggs = {
    'mjd': ['min', 'max', 'size'],
    'passband': ['min', 'max', 'mean', 'median', 'std'],
    'flux': ['min', 'max', 'mean', 'median', 'std','skew'],
    'flux_err': ['min', 'max', 'mean', 'median', 'std','skew'],
    'detected': ['mean'],
    'flux_ratio_sq':['sum','skew'],
    'flux_by_flux_ratio_sq':['sum','skew'],
}
agg_train = train.groupby('object_id').agg(aggs)

In [None]:
agg_train.head()

Fix the column names to only have 1 level:

In [None]:
new_columns = [
    k + '_' + agg for k in aggs.keys() for agg in aggs[k]
]
agg_train.columns = new_columns
agg_train.head()

Create new features that may be useful and clean up those that are no longer needed:

In [None]:
agg_train['mjd_diff'] = agg_train['mjd_max'] - agg_train['mjd_min']
agg_train['flux_diff'] = agg_train['flux_max'] - agg_train['flux_min']
agg_train['flux_dif2'] = (agg_train['flux_max'] - agg_train['flux_min']) / agg_train['flux_mean']
agg_train['flux_w_mean'] = agg_train['flux_by_flux_ratio_sq_sum'] / agg_train['flux_ratio_sq_sum']
agg_train['flux_dif3'] = (agg_train['flux_max'] - agg_train['flux_min']) / agg_train['flux_w_mean']
del agg_train['mjd_max'], agg_train['mjd_min']
agg_train.head()

Clean up our memory:

In [None]:
del train
gc.collect()

## Merge in Metadata 

First, load it:

In [None]:
meta_train = pd.read_csv('./data/Space_Data/training_set_metadata.csv')
meta_train.head()

Next, merge it:

In [None]:
full_train = agg_train.reset_index().merge(
    right=meta_train,
    how='outer',
    on='object_id'
)

Split out the target predictions into a vector, and examine unique classes:

In [None]:
if 'target' in full_train:
    y = full_train['target']
    del full_train['target']
classes = sorted(y.unique())

# Taken from Giba's topic : https://www.kaggle.com/titericz
# https://www.kaggle.com/c/PLAsTiCC-2018/discussion/67194
# with Kyle Boone's post https://www.kaggle.com/kyleboone
class_weight = {
    c: 1 for c in classes
}
for c in [64, 15]:
    class_weight[c] = 2

print('Unique classes : ', classes)

Split out object ids into their own df, since they're not predictive (note that double-brackets outputs a df rather than single brackets which would output a series). Also, clean up data that's not used for prediction.

In [None]:
if 'object_id' in full_train:
    oof_df = full_train[['object_id']]
    del full_train['object_id'], full_train['distmod'], full_train['hostgal_specz']
    del full_train['ra'], full_train['decl'], full_train['gal_l'],full_train['gal_b'],full_train['ddf']

Fill NA values with the means of populated data (but don't delete `train_mean` here since we'll need to use it to transform the test data!):

In [None]:
train_mean = full_train.mean(axis=0)
full_train.fillna(train_mean, inplace=True)

In [None]:
full_train.head()

# Decision Tree  

In [None]:
from sklearn import tree
target = meta_train["target"]
target.head(15)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(full_train, target, random_state=42)


In [None]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)


In [None]:
clf.score(X_train, y_train)

# Random Forest 

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200)
rf = rf.fit(X_train, y_train)
rf.score(X_train, y_train)

In [None]:
rf.score(X_test, y_test)

# Support Vector Machine

In [None]:
from sklearn.svm import SVC 
# using X_train, X_test, y_train, y_test
svm_space = SVC(kernel='rbf')
svm_space.fit(X_train, y_train)
y_predict = svm_space.predict(X_test)

In [None]:
accuracy = svm_space.score(X_test, y_test)
print(accuracy)

In [None]:
svm_space.score(X_train, y_train)

# Logistic Regression 

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier

In [None]:
classifier.fit(X_train, y_train)

In [None]:
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

In [None]:
predictions = classifier.predict(X_test)
predictions_df = pd.DataFrame({"Prediction": predictions, "Actual": y_test})
predictions_df.head()


In [None]:
predictions_df['Prediction'].unique()

In [None]:
len(predictions_df[predictions_df['Prediction'] == 90])

In [None]:
len(predictions_df)

This result says that either the training data is bad or this model is not good. Since it classified the objects as being in the 90 category. 

# K Nearest Neighbor 

In [None]:
len(X_train)

In [None]:
# KNN 
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3) 
knn.fit(X_train, y_train)  
Y_pred = knn.predict(X_test)  
acc_knn = knn.score(X_train, y_train)
print(acc_knn)

In [None]:
knn.score(X_test, y_test)

NameError: name 'KNeighborsRegressor' is not defined

# Neural Networks with Scikit-Learn

In [None]:
from sklearn.neural_network import MLPClassifier 

neural_network = MLPClassifier(hidden_layer_sizes=(64,16,8), solver="adam", random_state=1)
neural_network.fit(X_train, y_train)

print(f"Training Data Score: {neural_network.score(X_train, y_train)}")
print(f"Testing Data Score: {neural_network.score(X_test, y_test)}")

# Elastic Net 

In [None]:
from sklearn.linear_model import ElasticNet
model = ElasticNet(alpha=1, l1_ratio=0.5, normalize=False)
model.fit(X_train, y_train)

In [None]:
print(f"Training Data Score: {model.score(X_train, y_train)}")
print(f"Testing Data Score: {model.score(X_test, y_test)}")


# K-Means

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=15)

kmeans.fit(X_train)

In [None]:
predicted_cluters = kmeans.predict(X_train) 

In [None]:
print(f"Training Data Score: {kmeans.score(X_train, y_train)}")
print(f"Testing Data Score: {kmeans.score(X_test, y_test)}")


# Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

naive_model = GaussianNB()

naive_model.fit(X_train, y_train)

print(f"Training Data Score: {naive_model.score(X_train, y_train)}")
print(f"Testing Data Score: {naive_model.score(X_test, y_test)}")
