#### The following steps would help you get started. To **RUN THIS CODE HERE**, you need to *fork* a new branch (see the blue button on the top-right corner), and then execute each cell by pressing *Shift+Enter* or clicking on the blue botton on the left side of each cell.

* Let's import a few handy toolds.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from json import JSONDecoder, JSONDecodeError  # for reading the JSON data files
import re  # for regular expressions
import os  # for os related operations

Input data files are available in the '../input/' directory.
Any results you write to the current directory will be saved here as output.

* We can list all files in this directory:

In [7]:
print(os.listdir("../input"))

FileNotFoundError: [Errno 2] No such file or directory: '../input'

* To be able to read the data in json format, we need to have a decoder as follows:

In [2]:
def decode_obj(line, pos=0, decoder=JSONDecoder()):
    no_white_space_regex = re.compile(r'[^\s]')
    while True:
        match = no_white_space_regex.search(line, pos)
        if not match:
            return
        pos = match.start()
        try:
            obj, pos = decoder.raw_decode(line, pos)
        except JSONDecodeError as err:
            print('Oops! something went wrong. Error: {}'.format(err))
        yield obj

* As an example, let's implement a method that gets the last values of a multi-variate time series corresponding to each observation window.

In [3]:
def get_obj_with_last_n_val(line, n):
    obj = next(decode_obj(line))  # type:dict
    id = obj['id']
    class_label = obj['classNum']

    data = pd.DataFrame.from_dict(obj['values'])  # type:pd.DataFrame
    data.set_index(data.index.astype(int), inplace=True)
    last_n_indices = np.arange(0, 60)[-n:]
    data = data.loc[last_n_indices]

    return {'id': id, 'classType': class_label, 'values': data}

* The above methods allow us to load the data as Pandas.DataFrame, or even save them in CSV format. Let's define a new method that does this. Note that you can uncomment the part that stores the data in CSV format if you want.

In [11]:

def convert_json_data_to_csv(data_dir: str, file_name: str):
    """
    Generates a dataframe by concatenating the last values of each
    multi-variate time series. This method is designed as an example
    to show how a json object can be converted into a csv file.
    :param data_dir: the path to the data directory.
    :param file_name: name of the file to be read, with the extension.
    :return: the generated dataframe.
    """
    fname = os.path.join(data_dir, file_name)

    all_df, labels, ids = [], [], []
    with open(fname, 'r') as infile: # Open the file for reading
        for line in infile:  # Each 'line' is one MVTS with its single label (0 or 1).
            obj = get_obj_with_last_n_val(line, 1)
            all_df.append(obj['values'])
            labels.append(obj['classType'])
            ids.append(obj['id'])

    df = pd.concat(all_df).reset_index(drop=True)
    df = df.assign(LABEL=pd.Series(labels))
    df = df.assign(ID=pd.Series(ids))
    df.set_index([pd.Index(ids)])
    # Uncomment if you want to save this as CSV
    # df.to_csv(file_name + '_last_vals.csv', index=False)
    return df

* Now we are ready to load data. We try loading 'fold3Training.json' as an example. This should result in a dataframe with 27006 rows and 27 columns (i.e., all 25 physical parameters, plus two additional columns: ID and LABEL)

In [14]:
path_to_data = "./input"
file_name = "fold3Training.json"

df = convert_json_data_to_csv(path_to_data, file_name)  # shape: 27006 X 27
print('df.shape = {}'.format(df.shape))
# print(list(df))
print(df)

df.shape = (27006, 27)
           TOTUSJH        TOTBSQ        TOTPOT       TOTUSJZ      ABSNJZH  \
0      2279.058608  4.176910e+10  6.722922e+23  4.151445e+13   298.753182   
1       324.136602  3.044442e+09  1.842963e+22  7.596014e+12    64.312903   
2        90.928971  6.418759e+08  5.420498e+21  1.975487e+12     0.886584   
3       173.008586  2.210899e+09  2.422310e+22  3.389141e+12    10.262131   
4        56.286406  3.814089e+08  2.659824e+21  1.210523e+12     8.744935   
5        19.279922  1.569373e+08  1.154881e+21  4.398508e+11     1.149915   
6       769.412439  1.176934e+10  1.999628e+23  1.795664e+13    88.269954   
7       183.011264  1.274385e+09  8.443752e+21  3.505530e+12     9.407660   
8        15.650825  1.194461e+08  8.034953e+20  3.348638e+11     1.961768   
9       343.409575  5.359117e+09  9.470372e+22  7.374356e+12    59.548721   
10     2940.076791  2.753561e+10  5.259057e+23  5.326476e+13   578.557954   
11     2956.525273  3.766238e+10  5.248393e+23  5.886

* There are many ways to deal with missing values. The simplest approach would be to drop all rows which contain any missing values.

In [15]:
df = df.dropna()  # shape: 26666 X 27
print('df.shape = {}'.format(df.shape))

df.shape = (26666, 27)


* To train a simple classifier, we first need to have training and validation sets. For simplicity, let's assign the first two-third of this fold to our training set, and use the rest as a validation set.

In [None]:
t = (2/3) * df.shape[0]
df_train = df[df['ID'] <= t]  # shape: 18004 X 27
df_val = df[df['ID'] > t]  # shape: 9002 X 27
print('df_train.shape = {}'.format(df_train.shape))
print('df_val.shape = {}'.format(df_val.shape))

* Finally, time to train a model. Of course, we should import some packages first.

In [None]:
from sklearn import svm
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

**Note** that the training phase may take a few minutes.


In [20]:
dfn = df.to_numpy()
print(dfn.shape)
np.savez('test.npz', dfn)

(26666, 27)


Magnus trying stuff out

In [None]:
# Separate values and labels columns
df_train_data = df_train.iloc[:, :-2]  # all columns excluding 'ID' and 'LABEL'
df_train_labels = pd.DataFrame(df_train.LABEL)  # only 'LABEL' column

df_val_data = df_val.iloc[:, :-2]  # all columns excluding 'ID' and 'LABEL'
df_val_labels = pd.DataFrame(df_val.LABEL)  # only 'LABEL' column

# Train a simple SVM as an example
svm_c = 1000
svm_gamma = 0.01
clf = svm.SVC(gamma=svm_gamma, C=svm_c, max_iter=-1, verbose=1, shrinking=True, random_state=42)
clf.fit(df_train_data, np.ravel(df_train_labels))

* Our model is now ready for prediction. Let's see how good it performs. (We measure its performance using f1-score).

In [None]:
# Test the model against the validation set
pred_labels = clf.predict(df_val_data)

# Evaluate the predictions
scores = confusion_matrix(df_val_labels, pred_labels).ravel()
tn, fp, fn, tp = scores
print('TN:{}\tFP:{}\tFN:{}\tTP:{}'.format(tn, fp, fn, tp))
f1 = f1_score(df_val_labels, pred_labels, average='binary', labels=[0, 1])
print('f1-score = {}'.format(f1))

Oops! It seems that the model is not good at all! Maybe the model needs tuning! Maybe the data needs more preprocessing! Or, maybe the "last values", as we used in this example code, is not such a good predicator for solar flares!
#### You can pick up from here. It's all in your hands now.