# Build PNG Files

### Basic Data Set
In this notebook, we'll take the *`basic`* data set, use `ibmseti` Python package to convert each data file into a spectrogram, then save as `.png` files.

### Split into training / test (cross-validation) and zip
Also, we'll split the data set into a training set and a test set and create a handful of zip files for each class. This will dovetail into the next tutorial where we will train a custom Watson Visual Recognition classifier (we will use the zip files of pngs) and measure it's performance with the test (cross-validation) set. 

### Update for `primary`

You may want to adapt this script to use the `primary` data set. 

In [1]:
from __future__ import division

import cStringIO
import glob
import json
import requests
import ibmseti
import os
import zipfile
import numpy as np
import matplotlib.pyplot as plt


In [2]:
#Making a local folder to put my data.

mydatafolder = os.environ['PWD'] + '/' + 'current_data'
if os.path.exists(mydatafolder) is False:
    os.makedirs(mydatafolder)
print mydatafolder

/home/fbarilla/SETI/ML4SETI/tutorials/current_data


In [3]:
#If you are running this in IBM Apache Spark (via Data Science Experience)
base_url = 'https://dal05.objectstorage.service.networklayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b'

#ELSE, if you are outside of IBM:
base_url = 'https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b'

#NOTE: if you are outside of IBM, pulling down data will be slower. :/

In [None]:
## You don't need to repeat this, of course, if you've already done this in the Step 1 notebook

basic4zip = '{}/simsignals_basic_v2/basic4.zip'.format(base_url)
os.system('curl {} > {}/{}'.format(basic4zip, mydatafolder, 'basic4.zip'))

In [None]:
!ls -alrht $mydatafolder

In [None]:
outputpng_folder = mydatafolder + '/png'
if os.path.exists(outputpng_folder) is False:
    os.makedirs(outputpng_folder)
print outputpng_folder

In [None]:
#Use `ibmseti`, or other methods, to draw the spectrograms

def draw_spectrogram(data):
    
    aca = ibmseti.compamp.SimCompamp(data)
    spec = aca.get_spectrogram()

    # Instead of using SimCompAmp.get_spectrogram method
    # perform your own signal processing here before you create the spectrogram
    #
    # SimCompAmp.get_spectrogram is relatively simple. Here's the code to reproduce it:
    #
    # header, raw_data = r.content.split('\n',1)
    # complex_data = np.frombuffer(raw_data, dtype='i1').astype(np.float32).view(np.complex64)
    # shape = (int(32*8), int(6144/8))
    # spec = np.abs( np.fft.fftshift( np.fft.fft( complex_data.reshape(*shape) ), 1) )**2
    # 
    # But instead of the line above, can you maniputlate `complex_data` with signal processing
    # techniques in the time-domain (windowing?, de-chirp?), or manipulate the output of the 
    # np.fft.fft process in a way to improve the signal to noise (Welch periodogram, subtract noise model)? 
    # 
    # example: Apply Hanning Window
    # complex_data = complex_data.reshape(*shape)
    # complex_data = complex_data * np.hanning(complex_data.shape[1])
    # spec = np.abs( np.fft.fftshift( np.fft.fft( complex_data ), 1) )**2
    
    # Alternatively:
    # If you're using ibmseti 1.0.5 or greater, you can define a signal processing function,
    # which will be passed the 2D complex-value time-series numpy array. Your processing function should return a 2D
    # numpy array -- though it doesn't need to be complex-valued or even the same size.
    # The SimCompamp.get_spectrogram function will treat the output of your signals processing function
    # in the same way it treats the raw 2d complex-valued time-series data. 
    # The fourier transform of each row in the 2D array will be calculated
    # and then squared to produce the spectrogram.
    #
    # def mySignalProcessing(compData):
    #   return compData * np.hanning(compData.shape[1])
    #
    # aca.sigProc(mySignalProcessing)
    # spc = aca.get_spectrogram()
    #
    # You can define more sophisticated signal processing inside your function.  
    #


    fig, ax = plt.subplots(figsize=(10, 5))   

    # do different color mappings affect Watson's classification accuracy?
    # ax.imshow(np.log(spec), aspect = 0.5*float(spec.shape[1]) / spec.shape[0], cmap='hot')
    # ax.imshow(np.log(spec), aspect = 0.5*float(spec.shape[1]) / spec.shape[0], cmap='gray')
    # ax.imshow(np.log(spec), aspect = 0.5*float(spec.shape[1]) / spec.shape[0], cmap='Greys')
    
    # If you're going to plot the log, make sure there are no values less than or equal to zero 
    spec_pos_min = spec[spec > 0].min()
    spec[spec <= 0] = spec_pos_min

    ax.imshow(np.log(spec), aspect = 0.5*float(spec.shape[1]) / spec.shape[0], cmap='gray')
    
    return fig, aca.header()


In [None]:
from pyspark import SparkConf, SparkContext
conf = SparkConf().set("spark.ui.showConsoleProgress", "false")
sc = SparkContext(appName="PythonStatusAPIDemo", conf=conf)

## We're going to use Spark to distribute the job of creating the PNGs on the executor nodes
myzipfilepath = os.path.join(mydatafolder,'basic4.zip')
print myzipfilepath

zz = zipfile.ZipFile(myzipfilepath)

filenames = filter(lambda x: x.endswith('.dat'), zz.namelist())  #this filters out the top-level folder in the zip file, which is a separate entry in the namelist

rdd = sc.parallelize(filenames, 8) #2 executors are available on free-tier IBM Spark clusters. If you have access to an Enterprise cluster, which has 30 executors, you should parallize to 120 partitions

In [None]:
def extract_data(row):
    zzz = zipfile.ZipFile(myzipfilepath)
    return (row, zzz.open(row).read())

rdd = rdd.map(extract_data)

In [31]:
def convert_to_spectrogram_and_save(row):
    name = os.path.basename(row[0])
    fig, header = draw_spectrogram(row[1])
    png_file = name + '.png'
    fig.savefig(outputpng_folder + '/' + png_file)
    plt.close(fig)
    return (name, header, png_file)

In [32]:
rdd = rdd.map(convert_to_spectrogram_and_save)

In [33]:
results = rdd.collect()  #This took about 70s on an Enterprise cluster. It will take longer on your free-tier. 

In [34]:
results[0]

('000919a5-bc7f-471e-959c-81adba0b1f36.dat',
 {u'signal_classification': u'squiggle',
  u'uuid': u'000919a5-bc7f-471e-959c-81adba0b1f36'},
 '000919a5-bc7f-471e-959c-81adba0b1f36.dat.png')

# Create Training / Test sets

Using the `basic` list, we'll create training and test sets for each signal class. Then we'll archive the `.png` files into a handful of `.zip` files (We need the .zip files to be smaller than 100 MB because there is a limitation with the size of batches of data that are uploaded to Watson Visual Recognition when training a classifier.)

In [35]:
# Grab the Basic file list in order to 
# Organize the Data into classes

r = requests.get('{}/simsignals_files/public_list_basic_v2_26may_2017.csv'.format(base_url), timeout=(9.0, 21.0))

uuids_classes_as_list = r.text.split('\n')[1:-1]  #slice off the first line (header) and last line (empty)

def row_to_json(row):
    uuid,sigclass = row.split(',')
    return {'uuid':uuid, 'signal_classification':sigclass}

uuids_classes_as_list = map(lambda row: row_to_json(row), uuids_classes_as_list)
print "found {} files".format(len(uuids_classes_as_list))

uuids_group_by_class = {}
for item in uuids_classes_as_list:
    uuids_group_by_class.setdefault(item['signal_classification'], []).append(item)

found 4000 files


In [36]:
#At first, use just 20 percent and 10 percent. This will be useful 
#as you prototype. Then you can come back here and increase these
#percentages as needed.

training_percentage = 0.20
test_percentage = 0.10

assert training_percentage + test_percentage <= 1.0

training_set_group_by_class = {}
test_set_group_by_class = {}
for k, v in uuids_group_by_class.iteritems():
    
    total = len(v)
    training_size = int(total * training_percentage)
    test_size = int(total * test_percentage)
    
    training_set = v[:training_size]
    test_set = v[-1*test_size:]
    
    training_set_group_by_class[k] = training_set
    test_set_group_by_class[k] = test_set
    
    print '{}: training set size: {}'.format(k, len(training_set))
    print '{}: test set size: {}'.format(k, len(test_set))
    

squiggle: training set size: 200
squiggle: test set size: 100
narrowband: training set size: 200
narrowband: test set size: 100
noise: training set size: 200
noise: test set size: 100
narrowbanddrd: training set size: 200
narrowbanddrd: test set size: 100


In [37]:
training_set_group_by_class['noise'][0]

{'signal_classification': u'noise',
 'uuid': u'498becc2-3693-45b3-8533-50e93532706a'}

In [38]:
fnames = [outputpng_folder + '/' + vv['uuid'] + '.dat.png' for vv in v]

In [39]:
zipfilefolder = mydatafolder + '/zipfiles'
if os.path.exists(zipfilefolder) is False:
    os.makedirs(zipfilefolder)

In [40]:
max_zip_file_size_in_mb = 25

In [41]:
#Create the Zip files containing the training PNG files
#Note that this limits output files to be less than <max_zip_file_size_in_mb> MB because WatsonVR has a limit on the 
#size of input files that can be sent in single HTTP calls to train a custom classifier

for k, v, in training_set_group_by_class.iteritems():
    
    fnames = [outputpng_folder + '/' + vv['uuid'] + '.dat.png' for vv in v]  #yes, files are <uuid>.dat.png :/
    
    count = 1
    for fn in fnames:
        
        archive_name = '{}/classification_{}_{}.zip'.format(zipfilefolder, count, k)
        
        if os.path.exists(archive_name):
            zz = zipfile.ZipFile(archive_name, mode='a')
        else:
            print 'creating new archive', archive_name
            zz = zipfile.ZipFile(archive_name, mode='w')
           
        zz.write(fn)
        zz.close()
        
        #if archive_name folder exceeds <max_zip_file_size_in_mb> MB, increase count to create a new one
        if os.path.getsize(archive_name) > max_zip_file_size_in_mb * 1024 ** 2:
            count += 1
            

creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/classification_1_squiggle.zip
creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/classification_2_squiggle.zip
creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/classification_3_squiggle.zip
creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/classification_1_narrowband.zip
creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/classification_2_narrowband.zip
creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/classification_3_narrowband.zip
creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/classification_4_narrowband.zip
creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/classification_1_noise.zip
creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/classification_2_noise.zip

In [42]:
# Create the Zip files containing the test PNG files using the following naming convention:
# testset_<NUMBER>_<CLASS>.zip (The next notebook example using Watson will break if a 
# different naming convention is used) Refer to 
# https://www.ibm.com/watson/developercloud/visual-recognition/api/v3/#classify_an_image 
# for ZIP size and content limitations:
# "The max number of images in a .zip file is limited to 20, and limited to 5 MB."

for k, v, in test_set_group_by_class.iteritems():
    
    fnames = [outputpng_folder + '/' + vv['uuid'] + '.dat.png' for vv in v]  #yes, files are <uuid>.dat.png :/
    
    count = 1
    img_count = 0
    for fn in fnames:
        
        archive_name = '{}/testset_{}_{}.zip'.format(zipfilefolder, count, k)
        
        if os.path.exists(archive_name):
            zz = zipfile.ZipFile(archive_name, mode='a')
        else:
            print 'creating new archive', archive_name
            zz = zipfile.ZipFile(archive_name, mode='w')
           
        zz.write(fn)
        zz.close()
        img_count += 1
        #if archive_name folder exceeds 5 MB or there are more than 20 images, 
        # increase count to create a new one
        if (os.path.getsize(archive_name) >= 4.7 * 1024 ** 2) or img_count == 20:
            count += 1
            img_count = 0
            

creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/testset_1_squiggle.zip
creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/testset_2_squiggle.zip
creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/testset_3_squiggle.zip
creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/testset_4_squiggle.zip
creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/testset_5_squiggle.zip
creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/testset_6_squiggle.zip
creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/testset_7_squiggle.zip
creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/testset_8_squiggle.zip
creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/current_data/zipfiles/testset_1_narrowband.zip
creating new archive /home/fbarilla/SETI/ML4SETI/tutorials/cur

In [43]:
!ls -alrth $mydatafolder/zipfiles

total 459M
drwxrwxr-x 4 fbarilla fbarilla 4.0K Jul 31 17:01 ..
-rw-rw-r-- 1 fbarilla fbarilla  26M Jul 31 17:01 classification_1_squiggle.zip
-rw-rw-r-- 1 fbarilla fbarilla  26M Jul 31 17:01 classification_2_squiggle.zip
-rw-rw-r-- 1 fbarilla fbarilla  26M Jul 31 17:01 classification_3_squiggle.zip
-rw-rw-r-- 1 fbarilla fbarilla  26M Jul 31 17:01 classification_1_narrowband.zip
-rw-rw-r-- 1 fbarilla fbarilla  26M Jul 31 17:01 classification_2_narrowband.zip
-rw-rw-r-- 1 fbarilla fbarilla  26M Jul 31 17:01 classification_3_narrowband.zip
-rw-rw-r-- 1 fbarilla fbarilla 788K Jul 31 17:01 classification_4_narrowband.zip
-rw-rw-r-- 1 fbarilla fbarilla  26M Jul 31 17:01 classification_1_noise.zip
-rw-rw-r-- 1 fbarilla fbarilla  26M Jul 31 17:01 classification_2_noise.zip
-rw-rw-r-- 1 fbarilla fbarilla  26M Jul 31 17:01 classification_3_noise.zip
-rw-rw-r-- 1 fbarilla fbarilla 2.8M Jul 31 17:01 classification_4_noise.zip
-rw-rw-r-- 1 fbarilla fbarilla  26M Jul 31 17:01 classifica

# Exporting your Zip files

If you've been running this on an IBM DSX Spark cluster and you wish to move your data from the local filespace, the easiest and fastest way is to push these PNG files to an IBM Object Storage account. An Object Storage instances was created for you when you signed up for DSX. 

You do NOT need to do this if you're going on to the next notebook where you use Watson to classify your images from this Spark cluster. That notebook will read the data from the local file space.


### Get your Object Storage Credentials
1. Log in to https://bluemix.net
2. Scroll down and find your Object Storage instance. 
  * If you do not have one, find the "Catalog" link and look for the Object Storage service to create a new instance (5 GB of free space)
3. Select the `Service Credentials` tab and `View Credentials`
4. Copy these into your notebook below.

### Create a Container
5. CREATE A CONTAINER in your Object Storage that you will use below.


In [None]:
import swiftclient.client as swiftclient

credentials = {
  'auth_uri':'',
  'global_account_auth_uri':'',
  'username':'xx',
  'password':"xx",
  'auth_url':'https://identity.open.softlayer.com',
  'project':'xx',
  'projectId':'xx',
  'region':'dallas',
  'userId':'xx',
  'domain_id':'xx',
  'domain_name':'xx',
  'tenantId':'xx'
}

In [None]:
conn_seti_data = swiftclient.Connection(
    key=creds_seti_public['password'],
    authurl=creds_seti_public['auth_url']+"/v3",
    auth_version='3',
    os_options={
        "project_id": creds_seti_public['projectId'],
        "user_id": creds_seti_public['userId'],
        "region_name": creds_seti_public['region']})

In [None]:
myObjectStorageContainer = 'seti_pngs'  
someFile = os.path.join(zipfilefolder, 'classification_1_narrowband.zip')

etag = conn_seti_data.put_object(myObjectStorageContainer, someFile, open(someFile,'rb').read())
