## Combine pre and post pairing data with the pairing data before PCA

When doing PCA, it is useful to combine the pre and post pairing data along with the pairing data to get a large matrix in the temporal sequence pre-pairing, during pairing and post-pairing. This way, one can evaluate the change in stimulation amplitude through the whole session.

In [None]:
def combine_pre_during_post_data():
    if not os.path.isfile(os.path.join(working_directory, 'pawstimpairingresults','mean_data_for_condition_ALL.hdf5')):
        
        prepostdata = h5py.File(os.path.join(working_directory, 'pawstimresults','mean_data_for_condition.hdf5'),
                                'r')
        duringdata = h5py.File(os.path.join(working_directory, 'pawstimpairingresults','mean_data_for_condition.hdf5'),
                                'r')

        alldata = h5py.File(os.path.join(working_directory, 'pawstimpairingresults','mean_data_for_condition_ALL.hdf5'),
                                'a')
        for condition in duringdata.keys():
            temp = np.concatenate((np.nan_to_num(prepostdata[condition]['pre']),
                                   np.nan_to_num(duringdata[condition]),
                                   np.nan_to_num(prepostdata[condition]['post'])),axis=3)
            print temp.shape
            alldata.create_dataset(condition, data=temp)
        alldata.close()
    alldata = h5py.File(os.path.join(working_directory, 'pawstimpairingresults','mean_data_for_condition_ALL.hdf5'),
                                'r')
    return alldata

In [None]:
alldata = combine_pre_during_post_data()

In [None]:
fig = plot_traces_for_each_voxel(alldata, indices_for_windows, numvoxelstoshow=50, periodofinterest='poststim')

At this point, you can save the figure if you want, to the path you set below.

In [None]:
numvoxelstoshow = 50
sortby = 'Stimresponse_stimperiod'
path_to_save = os.path.join(results_directory,'Top%dvoxels_pawstim_sortedby%s_ordering'%(numvoxelstoshow,
                                                                                                sortby))
fig.savefig(path_to_save + '.png', format='png', dpi=300)
#fig.savefig(path_to_save + '.pdf', format='pdf')

In [None]:
reset_selective fig

## Convert data into PCA space

Now, we will do PCA on the data to get an unbiased factorization of the data. The PCA is done to compress the dimensionality of the voxels. These first set of functions are useful for performing the PCA and visualizing their raw results. Additional functions for advanced visualization will be defined later.

In [None]:
def preprocess_data(data):
    # This function flattens the data along the spatial dimensions. So size of data will
    # change from (numx, numy, numz, numt) to (numx*numy*numz, numt)
    
    #Parameters:
    #   1. The data to be flattened
    
    # Returns: Flattened data
    flattened_data = np.reshape(data,(np.prod(data.shape[:-1]), data.shape[3]))
    return flattened_data

def PCA_decomp(data, 
               pca_results_path, 
               indices_for_windows,
               min_variance_explained=0.8):
    # This function is called to do PCA decomposition. This checks if the PCA has already
    # been done by checking if a pickled file exists in pca_results_path. If this file
    # exists, it just loads those results. Otherwise, it performs the PCA decomposition.
    
    #Parameters:
    #   1. The data to perform PCA on. Shape: (numx, numy, numz, numt)
    #   2. pca_results_path: Path to where the PCA result would have been stored
    #                        if PCA has already been performed for this data.
    #   3. The minimum amount of variance that should be explained. The number of 
    #      PCs stored will be determined by this number.
    
    # Returns: Data in PCA space
    
    if os.path.isfile(pca_results_path): #has PCA already been done?
        return transform_from_loadedpca(data, pca_results_path)
    else:
        return perform_PCA_decomp(data, pca_results_path, indices_for_windows, min_variance_explained)

def transform_from_loadedpca(data, 
                             pca_results_path):
    # This function transforms inputted data based on stored PCA results for that data.
    
    #Parameters:
    #   1. The data on which PCA was performed.
    #   2. pca_results_path: Path to where the PCA result would have been stored
    #                        if PCA has already been performed for this data.
    
    # Returns: Data in PCA space
    transformed_data_path = os.path.join(results_directory, 'transformed_data.h5')
    if not os.path.isfile(transformed_data_path):
        transformed_data_handle = h5py.File(transformed_data_path,'x')
    transformed_data_handle = h5py.File(transformed_data_path,'r')
    if pca_results_path in transformed_data_handle:
        return transformed_data_handle[pca_results_path]
    else:
        transformed_data_handle.close()
        transformed_data_handle = h5py.File(transformed_data_path,'a')
        return perform_transformation(data, pca_results_path, transformed_data_handle)    
    
def perform_transformation(data,
                           pca_results_path,
                           transformed_data_handle):
    flattened_data = preprocess_data(data)
    pca = load_calculated_pca(pca_results_path)
    compressed_data = pca.transform(flattened_data.T).T #transform back to shape n_components x n_timepoints
    transformed_data_handle.create_dataset(pca_results_path, data=compressed_data)
    transformed_data_handle.close()
    return compressed_data

def perform_PCA_decomp(data, 
                       pca_results_path, 
                       indices_for_windows,
                       min_variance_explained=0.8):
    # This function performs the PCA decomposition. This is called only if 
    # there aren't any results from a previous run stored in pca_results_path.
    
    #Parameters:
    #   1. The data to perform PCA on. Shape: (numx, numy, numz, numt)
    #   2. pca_results_path: Path to where the PCA result should be stored.
    #   3. The minimum amount of variance that should be explained. The number of 
    #      PCs stored will be determined by this number.
    
    # Returns: Data in PCA space
    
    flattened_data = preprocess_data(data)
    pca = PCA(n_components=min_variance_explained)
    pca.fit(flattened_data.T) 
    compressed_data = pca.transform(flattened_data.T).T #transform back to shape n_components x n_timepoints
    pca, compressed_data = standardize_pca_sign(pca, compressed_data, indices_for_windows)
    joblib.dump(pca, pca_results_path)
    return compressed_data

def load_calculated_pca(pca_results_path):
    # This function loads the calculated PCA object.
    
    #Parameters:
    #   1. pca_results_path: Path to where the PCA result would have been stored
    #                        if PCA has already been performed for this data.
    
    # Returns: sklearn PCA object
    pca = joblib.load(pca_results_path)
    return pca

def standardize_pca_sign(pca, 
                         compressed_data,
                         indices_for_windows,
                         criterion='positivestimresponse'):
    # The PCs are only defined upto a negative sign, i.e. a 180 degree rotated PC vector
    # is equivalently a PC vector. This function prevents this ambiguity by enforcing
    # the sign of each PC to be such that the derivative of the trace is positive at the 
    # onset of the first stimulation. Other forms of standardization could also be used.  
    # Obviously, this works only for this current experiment with stimulation. In case
    # there is no stimulation, you could set the criterion to "positiveslope" in which 
    # case the function will ensure that the PC's trace has a positive linear trend 
    # through the recording.
    
    #Parameters:
    #   1. The sklearn PCA object.
    #   2. compressed_data: Data in PCA space
    #   3. indices for windows. Explained above
    #   4. criterion for standardization. Set to positivestimresponse to ensure that
    #      the PCs have a positive stimulation response.
    
    # Returns: Sign standardized input parameters
    for pc in range(compressed_data.shape[0]):
        trace = compressed_data[pc,:].T
        if criterion=='positivestimresponse':
            if np.diff(trace)[indices_for_windows[1]]<0:
                compressed_data[pc,:] = -compressed_data[pc,:]
                pca.components_[pc,:] = -pca.components_[pc,:]  
        elif criterion=='positiveslope':
            time = np.arange(trace.shape[0]).T
            time = sm.add_constant(time)
            lm = sm.OLS(trace, time).fit()
            if lm.params[1] < 0: #if slope < 0, flip the PC vector and the trace
                compressed_data[pc,:] = -compressed_data[pc,:]
                pca.components_[pc,:] = -pca.components_[pc,:]
    return pca, compressed_data

def extract_pc_vectors(pca_results_path, 
                       (numx, numy, numz)):
    # This function is used to extract the PC vectors from the stored results 
    # in pca_results_path and reshapes them to the original voxel tiling.
    
    #Parameters:
    #   1. PCA results path
    
    # Returns: pca_vectors
    
    pca = load_calculated_pca(pca_results_path)
    pca_vectors = np.reshape(pca.components_.T, (numx,numy,numz,pca.components_.shape[0]))
    return pca_vectors

def plot_variance_explained_per_pc(pca_results_path,
                                   fig=None,
                                   ax=None,
                                   label=''):
    # This function plots the % of variance explained by each PC.
    
    #Parameters:
    #   1. PCA results path
    #   2. Figure handle. Optional. Useful if you want to layer plots
    #      across all conditions
    #   3. Axis handle. Optional. Same as above.
    #   4. Label for the plot. Will be set to a condition when called later.
    
    # Returns: Figure and axis handle to the plot   
    pca = load_calculated_pca(pca_results_path)
    if fig is None or ax is None:
        fig,ax = plt.subplots()
    ax.plot(100*pca.explained_variance_ratio_, '.-', label=label)
    ax.set_ylabel('% of variance explained')
    ax.set_xlabel('PC number')
    return fig, ax

def plot_pc_vectors(pca_results_path,
                    (numx, numy, numz),
                    pc_of_interest):
    # This function plots the PC vector for pc_of_interest based on
    # the results stored in pca_results_path.
    # It returns a figure handle for this plot. One could then iterate over all PCs
    # to save the figures to your path of choice.
    
    #Parameters:
    #   1. PCA results path
    #   2. Spatial shape of the data
    #   3. pc_of_interest
    
    # Returns: Figure handle to the plot.
    
    pca_vectors = extract_pc_vectors(pca_results_path, (numx, numy, numz))
    
    fig, axs = plt.subplots(3, 4)
    vmax = np.amax(pca_vectors[:,:,:,pc_of_interest])
    vmin = np.amin(pca_vectors[:,:,:,pc_of_interest])
    vmaxsymmetric = np.maximum(np.abs(vmax),np.abs(vmin))
    vminsymmetric = -np.maximum(np.abs(vmax),np.abs(vmin))

    temp = np.swapaxes(pca_vectors,0,1) #For making the plot
    for ax, zplane in zip(axs.flat, range(0,numz)):
        ax.matshow(temp[:,:,zplane,pc_of_interest],
                   vmin=vminsymmetric,
                   vmax=vmaxsymmetric,
                   cmap=plt.get_cmap('seismic'))
        ax.get_xaxis().set_visible(False)
        ax.get_yaxis().set_visible(False)
    fig.tight_layout()
        
    return fig

        
def plot_pc_traces(data_in_pcaspace, pc_of_interest):  
    # This function plots the PC trace associated with pc_of_interest
    # given the data in pca space.
    # Calling PCA_decomp() returns the data in pca space. 
    # This function calculates a z-score of all the traces.
    # So each trace is normalized within itself.
    # Hence, note that this function wouldn't be appropriate to compare two traces
    # since their magnitudes are normalized within themselves, rather than between
    # them.
    
    #Parameters:
    #   1. Data in PCA space
    
    # Returns: figure handle to the plot.
    
    temp = (data_in_pcaspace[pc_of_interest,:])
    baseline = np.mean(temp[:indices_for_windows[1]])
    ztrace = (temp-np.mean(temp))/np.std(temp)#-np.log(temp/baseline) #=R2* multiplied by TE
    fig, ax = plt.subplots()
    sns.tsplot(ztrace, ax=ax)
    ax.set_ylabel('z-score PC signal (score)')
    ax.set_xlabel('Time (s)')
    fig.tight_layout()
        
    return fig

Calculate the data for the application of PCA. In our case, we will perform PCA on the mean data across all animals and runs for both conditions.

In [None]:
(numx, numy, numz, numt) = calculate_shape_of_data(f)
path_to_meandata_ALL = os.path.join(results_directory,'mean_data_for_condition_ALL.hdf5')
mean_data_for_condition = calculate_meansignal_across_animals(f, (numx,numy,numz,numt), path_to_meandata_ALL,
                                                             MIONdata, indices_for_windows)

Now specify the paths where you would like to store the results of the PCA. This set of paths will also be useful if you just want to load the calculated PCA results later.

In [None]:
pca_results_paths = {}
for condition in f.keys():
    pca_results_paths[condition] = os.path.join(results_directory,condition,'pca_results.pkl')
    mkdir_p(os.path.dirname(pca_results_paths[condition]))

Now we will perform the PCA for all conditions

In [None]:
conditions = pca_results_paths.keys()
conditions.sort()
for condition in conditions:
    PCA_decomp(mean_data_for_condition[condition], pca_results_paths[condition], indices_for_windows)

Since the PCA has been performed for all conditions, the most important thing to check how much variance the different number of consecutive principal components explain. If there are a few PCs that explain a much higher fraction of the variance than the rest of them, these are going to be the important PCs. It's likely that the last PCs explaining roughly equal fraction of the variance are noise-related.

We will now plot the percentage of explained variance per PC for all conditions and layer them on top of each other. This can be saved if you choose.


In [None]:
fig, ax = plt.subplots()
for condition in conditions:
    plot_variance_explained_per_pc(pca_results_paths[condition],fig,ax, label=condition)

ax.legend(loc='upper right')
fig.show()

Save the above figure if needed

In [None]:
figfile = os.path.join(results_directory,'Percent_variance_explained_by_pcs')
fig.savefig(figfile+'.png',format='png',dpi=300)

The next important thing to check is how the traces of activation in the PCA space look. If none of the PCs contain a strong stimulation evoked response, it is likely that there is no stimulation effect. Further, from the above plot of the fraction of variance explained, one can see that only the first 3 or so PCs are likely going to be representing anything interesting. Whether or not this representation is stimulation related is something that we will check by plotting the traces below.

It is a good idea to not save all the figure handles separately when plotting them. This will use a lot of memory and likely crash your computer. So save the figures immediately if you wish to save them into a directory so that you can scroll through the figures. Also, once the figures are saved, it's a good idea to clear all those handles so as to clear memory.
