# Welcome to the SETI Institute Code Challenge!

This first tutorial will explain a little bit on what the data is and where to get it.

# Update 21 June 2017

We learned a lot at the hackathon on June 10-11th and decided to regenerate the primary data set. This is called the `v3` primary data set. The changes, compared to `v2` are: the noise background is gaussian white noise instead of noise from the Sun, the signal amplitudes are higher and the characteristics should make them more distinguishable, and there are only 140k in the full set (20k per signal type), compared with 350k previously (50k per signal type).

The `basic` data set remains unchanged from before.

# Introduction

For the Code Challenge, you will be using the **"primary" data set**, as we've called it. The primary data set is   

    * labeled data set of 140,000 simulated signals
    * 7 different labels, or "signal classifications"
    * total of about 51 GB of data
    
This data set should be used to train your models. 

**You do not need to use all the data to train your models** if you do not want to or need to consume the entire set. 

There are also a **`small` and a `medium` sized subset** of these primary data files. 


## Simple Data Format

Each data file has a simple format: 

    * file name = <UUID>.dat
    * a JSON header in the first line that contains:
        * UUID
        * signal_classification (label)
    * followed by stream complex-valued time-series data. 

The `ibmseti` Python package is available to assist in reading this data and performing some basic operations for you. 

## Basic Warmup Data Set.

There is also a second, simple and clean data set that you may use for warmup, which we call the **"basic" data set**. This basic set should be used as a sanity check and for very early-stage prototyping. We recommend that everybody starts with this. 

    * Only 4 different signal classifications
    * 1000 simulation files for each class: 4000 files total
    * Available as single zip file
    * ~1 GB in total. 
       
### Basic Set versus Primary Set

> The difference between the `basic` and `primary` data sets is that the signals simulated in the `basic` set have, on average, much higher signal to noise ratio (they are larger amplitude signals). They also have other characteristics that will make the different signal classes very distinguishable. **You should be able to get very high signal classification accuracy with the basic data set.**  The primary data set has smaller amplitude signals and can look more similar to each other, making classification accuracy more difficult with this data set. There are also only 4 classes in the basic data set and 7 classes in the primary set. 


## Primary Data Sets

### Primary Small

The `primary small` is a subset of the full primary data set.  Use for early-stage prototyping.

  * All 7 signal classifications
  * 1000 simulations / class (7 classes = 7000 files)
  * Available as single zip file
  * ~2 GB in total

### Primary Medium

The `primary medium` is a subset of the full primary data set.  Use for prototyping and model building.

  * All 7 signal classifications
  * 5000 simulations / class (7 classes = 35000 files)
  * Large enough for relatively robust model construction
  * Available in 5 separate zip files
  * ~10 GB in total
 
### Primary Full

The `primary full` is the entire primary data set.  Use these data only if you want a very large training data set.

  * All 7 signal classifications
  * 20000 simulations / class (7 classes = 140000 files)
  * Only available in 140k individual files
    * one must read through the index file and download files individually, which will take some time from outside of IBM Cloud systems
  * ~50 GB in total
  

## Index Files

For all data sets, there exists an **index** file. That file is a CSV file. Each row holds the UUID, signal_classification (label) for a simulation file in the data set. You can use these index files in a few different ways (from using to keep track of your downloads, to facilitate parallelization of your analysis on Spark).   



## Direct Data URLs if you are working from outside of IBM Data Science Experience

### Basic4
[Data (1.1 GB)](https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_basic_v2/basic4.zip)

[Index File](https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_files/public_list_basic_v2_26may_2017.csv)


### Primary Small

[Data (1.9 GB)](https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v3_zipped/primary_small_v3.zip)

[Index File](https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_files/public_list_primary_v3_small_21june_2017.csv)

### Primary Medium

[Data Zip File 1 (1.9 GB)](https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v3_zipped/primary_medium_v3_1.zip)

[Data Zip File 2 (1.9 GB)](https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v3_zipped/primary_medium_v3_2.zip)

[Data Zip File 3 (1.9 GB)](https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v3_zipped/primary_medium_v3_3.zip)

[Data Zip File 4 (1.9 GB)](https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v3_zipped/primary_medium_v3_4.zip)

[Data Zip File 5 (1.9 GB)](https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v3_zipped/primary_medium_v3_5.zip)

[Index File](https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_files/public_list_primary_v3_medium_21june_2017.csv)


It's probably easiest to download these zip files, unzip them separately, then move the contents of to a single folder. 

### Primary Full

There are two ways to download the full data set. Scroll below to look for a programmatic way. Alternatively, a fellow participant has provided us [the "ninja" way to very quickly download the full data set with a single command-line call.](https://github.com/setiQuest/ML4SETI/blob/master/tutorials/Step_1b_Full_Data_Download.md) 

# Test Data Sets

Once you've trained your model, done all of your cross-validation testing, and are ready to submit an entry to the contest, you'll need to download the test data set and score the test set data with your model.  


The test data files are nearly the same as the training sets. The only difference is the JSON header in each file does not contain the signal class. You can use `ibmseti` python package to read each file, just like you would the training data. See [Step_2_reading_SETI_code_challenge_data.ipynb](https://github.com/setiQuest/ML4SETI/blob/master/tutorials/Step_2_reading_SETI_code_challenge_data.ipynb) for examples. 

### Note:
### July 1 - July 21:  Only the "Preview" test set is available.
###             July 21 - July 31: The final test set is now available. 

<br>

## Preview Test Set
The `primary_testset_preview_v3` data set contains 2414 test simulation files. Each data file is the same as the above training data except the JSON header does NOT contain the 'signal_classification' key. 

  * All 7 classes
  * Roughly 340 simulations per class  
  * JSON header with UUID only
  * Available as single zip file
  * 665 MB in total


### Direct Download Link

[Preview Test Set Zip File](https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v3_zipped/primary_testset_preview_v3.zip)

[Preview Test Set Index File](https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_files/public_list_primary_v3_testset_preview.csv)



## Final Test Set
The `primary_testset_final_v3` data set contains 2496 test simulation files. Each data file is the same as the above training data except the JSON header does NOT contain the 'signal_classification' key. 

  * All 7 classes
  * Roughly 350 simulations per class 
  * JSON header with UUID only
  * Available as single zip file
  * 687 MB in total



### Direct Download Link

[Final Test Set Zip File](https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v3_zipped/primary_testset_final_v3.zip)

[Final Test Set Index File](https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_files/public_list_primary_v3_testset_final.csv)





### Submitting Classification Results

See the [Judging Criteria](https://github.com/setiQuest/ML4SETI/blob/master/Judging_Criteria.ipynb) notebook for information on submitting your test-set classifications.

# Programmatically Accessing the Data

The data are stored in `containers` on IBM Object Storage. You can access these data with HTTP calls. Here we use system level `curl`, but you could easily use the Python `requests` package. 

The URL for all data files is composed of

  `base_url/container/objectname`.
 
The `base_url` is:

In [None]:
#If you are running this in IBM Apache Spark (via Data Science Experience)
base_url = 'https://dal05.objectstorage.service.networklayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b'

#ELSE, if you are outside of IBM:
base_url = 'https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b'

#NOTE: using the 2nd base_url, if you are outside of IBM, will be slower. :/

In [None]:
#Defining a local data folder to dump data

import os

mydatafolder = os.path.join( os.environ['PWD'], 'my_data_folder' )
if os.path.exists(mydatafolder) is False:
    os.makedirs(mydatafolder)

## Basic Data Set

We'll start with the basic data set.  Because the basic data set is small, we've created a `.zip` file of the full data set that you can download directly.  

In [None]:
import os

In [None]:
basic_container = 'simsignals_basic_v2'
basic4_zip_file = 'basic4.zip'

In [None]:
os.system('curl {}/{}/{} > {}'.format(base_url, basic_container, basic4_zip_file, mydatafolder + '/' + basic4_zip_file))

In [None]:
!ls -al my_data_folder/basic4.zip

## Primary Data Set


### Primary Small

The `primary_small` subset can be found in a zip file in
* contianer = 'simignals_v3_zipped'
* objectname = 'primary_small.zip'

In [None]:
filename = 'primary_small_v3.zip'
primary_small_url = '{}/simsignals_v3_zipped/{}'.format(base_url, filename)
os.system('curl {} > {}'.format(primary_small_url, mydatafolder +'/'+filename))

###### Primary Small Index File

A CSV file containing the UUID, signal classifications for each file in the `primary_small` subset.

In [None]:
filename = 'public_list_primary_v3_small_21june_2017.csv'
primary_small_csv_url = '{}/simsignals_files/{}'.format(base_url, filename)
os.system('curl {} > {}'.format(primary_small_csv_url, mydatafolder +'/'+filename))

### Primary Medium

Similarly, the `primary_medium` subset can be found in a handful of zip files

* contianer = 'simignals_v2_zipped'
* objectname = 'primary_medium_N.zip'
* for N = 1, 2, 3, 4, 5

In [None]:
med_N = '{}/simsignals_v3_zipped/primary_medium_v3_{}.zip'

for i in range(1,6):
    med_url = med_N.format(base_url, i)
    output_file = mydatafolder + '/primary_medium_v3_{}.zip'.format(i)
    print 'GETing', output_file
    os.system('curl {} > {}'.format(med_url, output_file ))

###### Primary Medium Index File

Here too, there is a CSV file containing the UUID, signal classifications for each file in the `primary_medium` subset.

In [None]:
filename = 'public_list_primary_v3_medium_21june_2017.csv'
med_csv_url = '{}/simsignals_files/{}'.format(base_url, filename)
os.system('curl {} > {}'.format(med_csv_url, mydatafolder +'/'+filename))

### Primary Full set

Because the full set is so incredibly large, we only have these 140,000 files available individually on object storage. 

A fellow participant has provided us [a nice way to very quickly download the full data set with a single command-line call.](https://github.com/setiQuest/ML4SETI/blob/master/tutorials/Step_1b_Full_Data_Download.md) 

Or you can follow the instructions below.

The `primary_full` list can be found here: 

In [None]:
filename = 'public_list_primary_v3_full_21june_2017.csv'
prim_full = '{}/simsignals_files/{}'.format(base_url, filename)
os.system('curl {} > {}'.format(prim_full, mydatafolder +'/'+filename))

One can download this list and begin to pull down files individually if desired. Warning, **however, this will take approximately a billion years if you are not running on IBM Apache Spark** -- IBM Apache Spark and Object Storage exist in the same data center and share a fast network connection. 

The data are found in 

`base_url/simsignals_v2/<uuid>.dat`

For example:

https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v3/03fbb014-344a-42bc-a3e4-f755557a3628.dat

**We are working to make the primary full data set more easily, as this current setup is less than ideal. You will be notified if that becomes available. The data will be directly available for participants of the hackathon, however.**

If you wish to programmatically begin to download the full data set you may use the following code. 

In [None]:
import requests
import copy

In [None]:
file_list_container = 'simsignals_files'
file_list = 'public_list_primary_v3_full_21june_2017.csv'
primary_data_container = 'simsignals_v3'

In [None]:
r = requests.get('{}/{}/{}'.format(base_url, file_list_container, file_list), timeout=(9.0, 21.0))
filecontents = copy.copy(r.content)

In [None]:
full_primary_files = [line.split(',') for line in filecontents.split('\n')]
full_primary_files = full_primary_files[1:-1] #strip the header and empty last element
full_primary_files = map(lambda x: x[0]+".dat", full_primary_files)  #now list of file names (<uuid>.dat)

In [None]:
#save your data into a local subfolder
save_to_folder = mydatafolder + '/primary_data_set'
if os.path.exists(save_to_folder) is False:
    os.mkdir(save_to_folder)

In [None]:
count = 0
total = len(full_primary_files)
for row in full_primary_files:
    r = requests.get('{}/{}/{}'.format(base_url, primary_data_container, row), timeout=(9.0, 21.0))
    
    if count % 100 == 0:
        print 'done ', count, ' out of ',  total
    count += 1
    
    with open('{}/{}'.format(save_to_folder, row), 'w' ) as fout:
        fout.write(r.content)

### This is really a lot of data

This will be a difficult data set to consume and process if you are using free-tier levels of software from any Cloud provider. You will likely want to have a robust machine, or sets of machines, with many threads and GPUs if you want to train models with such a large dat set. 

For example, if you have access to an IBM Spark Enterprise cluster, because the network connection between IBM Spark and IBM Object Storage is so fast, we recommend that you **do NOT** download each file. Instead you could parallelize the index file and then retrieve and process each file on a worker node. 

In [None]:
## Using Spark -- can parallelize the job across your worker nodes
import ibmseti
def retrieve_and_process(row):
    try:
        r = requests.get('{}/{}/{}'.format(base_url, primary_data_container, row), timeout=(9.0, 21.0))
    except Exception as e:
        return (row, 'failed', [])
    
    aca = ibmseti.compamp.SimCompamp(r.content)
    spectrogram = aca.get_spectrogram() # or do something else
    features = my_feature_extractor(spectrogram) #example external function for reducing the spectrogram into a handful of features, perhaps
    
    signal_class = aca.header()['signal_classifiation']
        
    return (row, signal_class, features)

npartitions = 60  
rdd = sc.parallelize(full_primary_files, npartitions)

#Now ask Spark to run the job
process_results = rdd.map(retrieve_and_process).collect()