# LIGO - Gravitational Waves simple classification - Part 1

This is our final work for the "Seminario intensivo de topicos avanzados en datos complejos" (Intensive seminar in advanced topics of complex data). We pretend to download a pre-classified set of gravitational waves provided by the LIGO observatory (Both, Livingston and Hanford observatories) during its first and second observatory run (O1 and O2).
This dataset is part of the whole Gravity Spy paper and libraries, this is the set for which that project's convolutional networks were trained. (For more data on the subject, please refer to ["Gravity Spy - Integrated Advanced LIGO detector characterization machine learning and citizen science"](https://arxiv.org/abs/1611.04596).). Since the dataset size isn't trivial, we provide you the original place from which it can be [dowloaded](https://zenodo.org/record/1486046#.XrLYbi-ZMSQ).

> Please, keep in mind that this work is the first stone in the way of my specialization's final work; which will entail the work over the classification of gravitational waves using convolutional neural networks.

> Take into account, that this whole notebook, datasets and scripts doesn't runs in the course's docker images preset for spark. We have changed the /spark/Dockerfile in order to generate the due versions. The Docker-Compose.yaml file has been updated to work with this very image, as well.

The data set is divided in three parts:

There are three files provided in this data set

- trainingset_v1d1_metadata.csv

This file has many columns, gravityspy_id, label, and sample_type. 
*gravityspy_id* is the unique 10 character hash given to every Gravity Spy sample. 
*label* is the string label of the sample. 
*sample_type* indicates whether this sample was used in the paper for testing training or validating the models. This is provided for those who would like to do direct comparisons to the network described in the paper.
Additional columns contain some metadata information about the "glitchs".

- trainingsetv1d1.h5

This file contains the exact arrays used in the paper for every Gravity Spy sample. Each Gravity Spy sample is defined by four different images with varying temporal duration, 0.5, 1.0, 2.0, and 4.0 second, respectively. 
This file contains all the information needed for each sample in the Gravity Spy dataset.

- trainingsetv1d1.tar.gz

Contains the raw PNGs of the Gravity Spy training set.
The structure of the folder is /"label"/"sample_type"/"pngs", and is the same information you could find in the HDF5 file above.

## What we are going to do

We are going to divide this work in two parts:

1. Merge the images with the data in the csv file, so we have a simple spark dataframe to operate with. We call this task "regularization", or "nomalization" of the data. This notebook will resolve this step and show up our exploration process. However, the script that will do the heavy lifting will be: *generate_gw_dataset.py* which will do the very same things that does this whole notebook, but with the whole dataset, not the first 30th rows.

2. We are going to load up this dataframe and apply at least one classifier over the data with the intention of identifying "Chrips" and "no-chrips". Keep in mind that the "Chirps" according to the referenced paper, are "Gravitational Waves". All the other are just noise resembling them. This is going to be resolved in a different notebook and python's script.





In [None]:
import numpy as np
import h5py
import os
from gwpy.timeseries import TimeSeries

In [2]:
#Getting the correct path
os.getcwd()


'/notebook'

In [52]:
hd5f_file='/dataset/trainingsetv1d1.h5'

In [53]:
hf = h5py.File(hd5f_file, 'r')


In [54]:
list(hf.keys())


['1080Lines',
 '1400Ripples',
 'Air_Compressor',
 'Blip',
 'Chirp',
 'Extremely_Loud',
 'Helix',
 'Koi_Fish',
 'Light_Modulation',
 'Low_Frequency_Burst',
 'Low_Frequency_Lines',
 'No_Glitch',
 'None_of_the_Above',
 'Paired_Doves',
 'Power_Line',
 'Repeating_Blips',
 'Scattered_Light',
 'Scratchy',
 'Tomte',
 'Violin_Mode',
 'Wandering_Line',
 'Whistle']

In [6]:
import matplotlib.pyplot as plt
#This snippet was use to understand how the image load ffffrom the dataset was stored; how "clear" it was.
png = png[0]
plt.imshow(png)

plt.axis('off')
plt.show

#Image.fromarray(png[0],'RGB').show()

<function matplotlib.pyplot.show>

In [1]:
# Spark version
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("pyspark-gw").getOrCreate()

print(spark.version)


2.4.5


In [55]:
csv = '/dataset/trainingset_v1d1_metadata.csv'
df = spark.read.csv(csv, header=True)


In [56]:
import pandas as pd

#first, we get the dataframe in pandas with all the headers for the data.
csv = '/dataset/trainingset_v1d1_metadata.csv'
df=pd.read_csv(csv, sep=',',header=0)

df.head()

Unnamed: 0,event_time,ifo,peak_time,peak_time_ns,start_time,start_time_ns,duration,search,process_id,event_id,...,chisq_dof,param_one_name,param_one_value,gravityspy_id,label,sample_type,url1,url2,url3,url4
0,1134216000.0,L1,1134216192,931639909,1134216192,832031011,0.1875,Omicron,0,21,...,0,phase,-2.72902,zmIdpucyOG,Whistle,train,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...
1,1129360000.0,L1,1129359781,558593034,1129359781,47851085,0.94238,Omicron,0,107,...,0,phase,1.10682,zWFRqqDxwv,Whistle,test,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...
2,1127425000.0,L1,1127425468,976317882,1127425468,960937023,0.04688,Omicron,0,218,...,0,phase,-0.83099,zKCTakFVcf,Whistle,train,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...
3,1132637000.0,L1,1132636755,365233898,1132636754,951172113,0.82422,Omicron,0,88,...,0,phase,0.76242,z14BdoiFZS,Whistle,validation,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...
4,1132036000.0,L1,1132035853,197264909,1132035852,933837890,2.00366,Omicron,0,16,...,0,phase,-0.31161,yyjqLCtAmO,Whistle,validation,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...


In [18]:
#now, we test the idea of disecting the CSV
gravityspy_id= df['gravityspy_id'][0]
label= df['label'][0]
sample_type= df['sample_type'][0]

png = hf[label][sample_type][gravityspy_id]['0.5.png']
png = png[0]
png

array([[0.10231367, 0.13103157, 0.14924803, ..., 0.09779692,
        0.09669176, 0.08488595],
       [0.0949822 , 0.11268353, 0.15216126, ..., 0.1067898 ,
        0.09669176, 0.09332079],
       [0.11522029, 0.1067898 , 0.13609917, ..., 0.10341883,
        0.11163108, 0.11008061],
       ...,
       [0.08779499, 0.08779499, 0.08779499, ..., 0.11614116,
        0.11669263, 0.11881158],
       [0.09332079, 0.09332079, 0.09332079, ..., 0.13204627,
        0.1351345 , 0.1351345 ],
       [0.1067898 , 0.1067898 , 0.1067898 , ..., 0.1067898 ,
        0.1067898 , 0.1067898 ]], dtype=float32)

In [62]:
#forming a new dataset

gwdf = df[0:30] #we reduce the set of data to test the algorithm at a fast rate.

print(gwdf.shape)
print(str(type(gwdf)))
print(gwdf.columns)

gvtyid = 21
labelid = 22
sampleid=23

gwdf.shape[0]
gwdf['png'] = ''#pd.DataFrame({}) 

print(gwdf.shape)
print(gwdf['png'])


for index, record in gwdf.iterrows():
    gravityspy_id= record[gvtyid]
    label= record[labelid]
    sample_type= record[sampleid]
    
    png = np.array(hf[label][sample_type][gravityspy_id]['0.5.png'][0])
    
    png = np.reshape(png,23800).tolist() #we place the whole image in just one dimension array.
    
    gwdf.at[index, 'png'] = png

#gwdf.to_csv('/dataset/gw_gravity_spy_dataframe.csv') #<-We tested this approach as well, but the png array was summarized.
#gwdf.to_pickle('/dataset/test.pickle') #<- we tested this approach, but way too many record were saved as null values.


(30, 28)
<class 'pandas.core.frame.DataFrame'>
Index(['event_time', 'ifo', 'peak_time', 'peak_time_ns', 'start_time',
       'start_time_ns', 'duration', 'search', 'process_id', 'event_id',
       'peak_frequency', 'central_freq', 'bandwidth', 'channel', 'amplitude',
       'snr', 'confidence', 'chisq', 'chisq_dof', 'param_one_name',
       'param_one_value', 'gravityspy_id', 'label', 'sample_type', 'url1',
       'url2', 'url3', 'url4'],
      dtype='object')
(30, 29)
0     
1     
2     
3     
4     
5     
6     
7     
8     
9     
10    
11    
12    
13    
14    
15    
16    
17    
18    
19    
20    
21    
22    
23    
24    
25    
26    
27    
28    
29    
Name: png, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [63]:
import findspark
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext

findspark.init()
spark = SparkSession.builder.appName("pyspark-gw").getOrCreate()

sqlCtx = SQLContext(spark)
sdf = sqlCtx.createDataFrame(gwdf) #getting a spark dataframe out of a Pandas dataframe.

In [68]:
sdf.head() #we test to see if the set is as such.

Row(event_time=1134216192.9316401, ifo='L1', peak_time=1134216192, peak_time_ns=931639909, start_time=1134216192, start_time_ns=832031011, duration=0.1875, search='Omicron', process_id=0, event_id=21, peak_frequency=1337.6953125, central_freq=1120.04321289062, bandwidth=573.363952636719, channel='GDS-CALIB_STRAIN', amplitude=1.1976500147017298e-22, snr=7.511390209198001, confidence=0, chisq=0, chisq_dof=0, param_one_name='phase', param_one_value=-2.7290200000000002, gravityspy_id='zmIdpucyOG', label='Whistle', sample_type='train', url1='https://panoptes-uploads.zooniverse.org/production/subject_location/e0a29b6e-30da-4b83-a823-d84262e5f32b.png', url2='https://panoptes-uploads.zooniverse.org/production/subject_location/ccf7f147-a727-46e6-95ec-65f3c7b619f8.png', url3='https://panoptes-uploads.zooniverse.org/production/subject_location/1a3a0247-6b62-4f64-884f-387348c0557f.png', url4='https://panoptes-uploads.zooniverse.org/production/subject_location/0ffb52cb-ad48-4ae9-b120-a8aa5aa3d77b.p

In [13]:
unpickled_df = pd.read_pickle("/dataset/test.pickle")
unpickled_df

Unnamed: 0,event_time,ifo,peak_time,peak_time_ns,start_time,start_time_ns,duration,search,process_id,event_id,...,param_one_name,param_one_value,gravityspy_id,label,sample_type,url1,url2,url3,url4,png
0,1134216000.0,L1,1134216192,931639909,1134216192,832031011,0.1875,Omicron,0,21,...,phase,-2.72902,zmIdpucyOG,Whistle,train,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,"[0.10231367, 0.13103157, 0.14924803, 0.1039726..."
1,1129360000.0,L1,1129359781,558593034,1129359781,47851085,0.94238,Omicron,0,107,...,phase,1.10682,zWFRqqDxwv,Whistle,test,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,"[0.086593725, 0.086593725, 0.11495804, 0.12481..."
2,1127425000.0,L1,1127425468,976317882,1127425468,960937023,0.04688,Omicron,0,218,...,phase,-0.83099,zKCTakFVcf,Whistle,train,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,"[0.1650454, 0.096691765, 0.13029973, 0.0832227..."
3,1132637000.0,L1,1132636755,365233898,1132636754,951172113,0.82422,Omicron,0,88,...,phase,0.76242,z14BdoiFZS,Whistle,validation,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,"[0.24178803, 0.15504831, 0.16901046, 0.1666115..."
4,1132036000.0,L1,1132035853,197264909,1132035852,933837890,2.00366,Omicron,0,16,...,phase,-0.31161,yyjqLCtAmO,Whistle,validation,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,"[0.13052422, 0.097796924, 0.13480349, 0.207155..."
5,1163422000.0,H1,1163421591,621093034,1163421591,492187023,0.38281,OMICRON,0,228,...,phase,1.56686,tsUHxRhgQU,1080Lines,train,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,"[0.13052422, 0.121122606, 0.0988869, 0.0916618..."
6,1135087000.0,L1,1135086850,427246093,1135086850,310547113,0.70312,Omicron,0,78,...,phase,0.50844,yZPB2Lkecd,Whistle,validation,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,"[0.30458266, 0.2807377, 0.124193706, 0.0922156..."
7,1136285000.0,L1,1136285262,929687023,1136285262,976085,1.62012,Omicron,0,0,...,phase,0.01421,yKsXudIzbX,Whistle,train,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,"[0.12370349, 0.09166182, 0.11043775, 0.1335218..."
8,1132651000.0,L1,1132651216,955077886,1132651216,0,1.24512,Omicron,0,92,...,phase,1.13611,xo7m4GKIOx,Whistle,train,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,"[0.1071072, 0.155646, 0.11645582, 0.08377655, ..."
9,1132637000.0,L1,1132637476,677733898,1132637476,342772960,0.74317,Omicron,0,84,...,phase,-1.97719,xmxiRoeHxh,Whistle,test,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,https://panoptes-uploads.zooniverse.org/produc...,"[0.08377655, 0.09553355, 0.10988048, 0.0860399..."


In [72]:
#We use the spark dataset to write its contents (all the gravity spy's dataset) to a partquet file for easy classification. 
sdf.write.parquet('/dataset/gw_gravity_spy_dataframe.parquet')

In [74]:
parquet_df = spark.read.parquet('/dataset/gw_gravity_spy_dataframe.parquet')
parquet_df.head()

Row(event_time=1163850760.01758, ifo='L1', peak_time=1163850760, peak_time_ns=17577886, start_time=1163850759, start_time_ns=968750000, duration=0.125, search='OMICRON', process_id=0, event_id=163, peak_frequency=1540.77734375, central_freq=1837.2570800781198, bandwidth=2711.09765625, channel='GDS-CALIB_STRAIN', amplitude=5.167969911283529e-22, snr=13.6570596694946, confidence=0, chisq=0, chisq_dof=0, param_one_name='phase', param_one_value=-2.30207, gravityspy_id='msSTHcULv9', label='Violin_Mode', sample_type='train', url1='https://panoptes-uploads.zooniverse.org/production/subject_location/b8c5da87-67b5-4ed9-b6fd-65bb2b916179.png', url2='https://panoptes-uploads.zooniverse.org/production/subject_location/cca146ea-47e4-4ee4-b6c3-6b8a3d2d893b.png', url3='https://panoptes-uploads.zooniverse.org/production/subject_location/663f27b5-1757-4261-922b-9c43b6b0be30.png', url4='https://panoptes-uploads.zooniverse.org/production/subject_location/8953dc2d-ae15-4dae-9d74-e5dd110ecdfb.png', png=[0.