# Time series analysis in unevenly-spaced datasets

In this assignment, we're going to be working with a the same time series dataset of a Galactic [low-mass x-ray binary](https://en.wikipedia.org/wiki/X-ray_binary#Low-mass_X-ray_binary), GX 5-1 ([Jonker et al. 2002](http://adsabs.harvard.edu/abs/2002MNRAS.333..665J)), as we did in the pre-class assignment.  The difference is that in class today we'll be working with the [Lomb-Scargle Periodogram](https://en.wikipedia.org/wiki/Least-squares_spectral_analysis#The_Lomb.E2.80.93Scargle_periodogram), which can be used to extract information from irregularly-spaced time series.

In [None]:
import numpy as np
import numpy.random as npr
%matplotlib inline
import matplotlib.pyplot as plt

# get the data!
counts = np.loadtxt("GX.dat")
print("original array shape:", counts.shape)

# the array is the wrong shape because the data is structured oddly.  The counts
# are in order in time, and meant to be read left to right, top to bottom.  (So
# each row is contiguous in time, and the following row comes after it in time.)
# we can sort this out by reshaping the array to be 1D
counts_regular = np.reshape(counts,counts.size,order='C')
print("NEW array shape:     ", counts_regular.shape)

# make an array 
times_regular = np.arange(0.0,counts_regular.size/128.0,1.0/128.0)

Before we do anything with Lomb-Scargle, let's first take a quick look at a FFT of the regularly-spaced data to see what to expect:

In [None]:
import scipy.fftpack

# do a FFT assuming everything is real
yfr = scipy.fftpack.rfft(counts_regular)

# frequency bins (remember, rfft returns an array of the same length you put in,
# but alternates real/imaginary components so we only want a frequency array that's half
# of the size of the counts array)
dt = 1./128.
xf = np.linspace(0.0, 1.0/(2.0*dt), counts_regular.size//2)

# only plotting the real part part of the array returned by rfft
plt.plot(xf,2.0/counts_regular.size * np.abs(yfr[1::2]),'b-')
plt.xlabel('frequency [Hz]')
plt.ylabel('power [arbitrary]')
plt.title('FFT of count rates')


Now, let's make the dataset irregularly spaced in time!  We're going to randomly subsample the data, and then we're also going to mask out windows of data (because, e.g., we can't observe many things during the daytime with an optical telescope).  The total amount of data that we're going to keep is 0.5 * frac, since half of the data will be thrown away by the windowing.

In [None]:
frac = 0.5  # fraction of data to keep (before windowing)
window = 256  # size of windows to mask out 

npr.seed(8675309)

# create a boolean array where only frac of the data is 'True'.
bool_array = npr.rand(counts_regular.size) < frac

# we are going to make half of the boolean array false by reshaping it into a 2D array, zeroing
# out every other row in the zero array, and then reshaping it back to a 1D array.  This is sort
# of complicated, but the most compact way of doing this (otherwise we'd have to do a loop)
bool_array = bool_array.reshape((bool_array.size//window,window))
bool_array[::2,:]=False
bool_array = bool_array.reshape((bool_array.size))

# now we generate a set of irregularly-spaced counts.
counts_irr = counts_regular[bool_array]
times_irr  = times_regular[bool_array]

# plot it out!
# subsampled data is in blue, the original dataset is in red.
plt.plot(times_regular,counts_regular,'r.',times_irr,counts_irr,'bo')
plt.xlim(0,20)
plt.xlabel('time [s]')
plt.ylabel('counts/bin')
plt.title('Subsampled data')

## Lomb-Scargle

Now, implement your own version of the Lomb-Scargle Periodogram (Feigelson equations 11.36 and 11.37), and plot the results for a range of frequencies corresponding to the range of frequencies shown above (say for frequencies between 1 and 60 Hz in steps of $\Delta f = 1$ Hz).

Question: Does your result change as you change the subsampling (the variable ```frac```) and the size of the window where data is removed (the variable ```window```)?  If so, how does the result change?

In [None]:
#counts_irr = counts_regular
#times_irr = times_regular

'''
What I'm trying to do here is get the frequency sampling to be as close to the astropy sampling as possible!
'''

# total duration of the data sampling
baseline = times_irr[-1]-times_irr[0]

# number of samples
samples = times_irr.size

# this is 5 in astropy
samples_per_peak = 2

# frequency spacing
df = 1.0/baseline/samples_per_peak

min_freq = 0.5*df

# 0.5*samples/baseline = average nyquist frequency = 1 / (2 * avg. delta_t)
# but, we know that data is unevenly sampled, so we may be able to get better frequency
# data than that.
nyquist_factor = 2
max_freq = nyquist_factor*(0.5*samples/baseline)

# calculate number of frequency bins
Nf = 1 + int(np.round((max_freq-min_freq)/df))

print("number of frequencies from scipy method:", Nf)

frequencies_mine = min_freq + df*np.arange(Nf)

# ERASE THIS
frequencies_mine = np.arange(1.0,61.0,0.1)

power_mine = np.zeros_like(frequencies_mine)

print("how many frequency bins am I actually using?", frequencies_mine.size)

In [None]:
'''
DOING THIS MAKES ALL THE DIFFERENCE

This normalizes the y-value so that it has a mean of zero.  Apparently 
having a dc offset in the values of y (X_i in Feigelson) can cause some
fairly unpredictable behavior, and thus it is important to subtract off 
the mean.  It is CRITICAL to do this.  Feigelson does not mention it; I had
to figure it out by reading the astropy source code.  (Dammit.)

'''
counts_irr -= counts_irr.mean()


In [None]:
# times_irr, counts_irr
for i in range(frequencies_mine.size):
    freq = frequencies_mine[i]
    omega = 2.0*np.pi*freq
    
    # calculate Tau
    numer = (np.sin(2.0*omega*times_irr)).sum()
    denom = (np.cos(2.0*omega*times_irr)).sum()
    tau = np.arctan2(numer,denom)/(2.0*omega)
    
    variance = np.var(counts_irr)
    
    power_mine[i] = (0.5/(variance))*( (((counts_irr*np.cos(omega*(times_irr-tau))).sum())**2 
                                 / (( (np.cos(omega*(times_irr-tau)))**2).sum())) +   
                                (((counts_irr*np.sin(omega*(times_irr-tau))).sum())**2 
                                 / (( (np.sin(omega*(times_irr-tau)))**2).sum())) )
    

In [None]:
plt.plot(frequencies_mine,power_mine)
#plt.xlim(0,50)
#plt.ylim(0.0,2000)

### Now check your results!  

[Astropy](http://www.astropy.org/) is a python package that contains a wide variety of useful astronomy-related Python tools.  It happens to include an [implementation of the Lomb-Scargle periodogram](http://docs.astropy.org/en/stable/stats/lombscargle.html) that is probably more efficient than the one we are writing in class, and it's also a good sanity check on your results.  You may need to update astropy, as the version that came with the Anaconda distribution we installed at the beginning of the semester is a bit old.  To check what version you have, do the following:

```
import astropy
print(astropy.version.version)
```

If you do not have at least version 1.3 of astropy, you need to upgrade astropy.  At the command line, type:

```
pip install astropy --upgrade
```

You may have to quit and restart your Jupyter notebook server to ensure that you have accesss to the new version.

**Once you have upgraded,** use the astropy Lomb-Scargle module to check the outputs from your own routine!

In [None]:
from astropy.stats import LombScargle

freq = np.arange(0.8, 1.3, 0.1)
frequency, power = LombScargle(times_irr, counts_irr).autopower()
plt.plot(frequency,power/power.max(),'r-',alpha=0.5)
plt.plot(frequencies_mine,power_mine/power_mine.max(),'b-',alpha=0.5)
plt.xlim(20,30)
