**Purpose:** Investigate synchrony between the cell division and metabolic cycles.

This is inspired by fig 1 of Bieler et al. (2014), show below.  It shows the relationship between the cell division and circadian cycles.

![title](msb0010-0739-f1.jpg)

Figure 1. Circadian and cell cycle oscillators are tightly synchronized in NIH3T3 cells
- A. Single-cell time traces showing the circadian YFP signal (black, identified maxima in blue denoted as p), together with cell division events (nuclear envelope breakdown, red, denoted as d). The top trace is typical and shows three divisions before the circadian peaks, the second trace shows an early first division.
- B. Raster plot showing 3,160 traces (with at least two circadian peaks) aligned on the second circadian peak (blue), and sorted according to the interval between the first and second circadian peaks. Divisions (red) show a clear tendency to occur, on average, 5 h before the circadian peaks. A sparse group of early division events associated usually with longer circadian intervals is also visible.
- C. Division times measured with respect to the subsequent circadian peak show a unimodal distribution centered at −5 h. Inset: longer circadian intervals correlate with mitosis occurring, on average, closer to the next peak (also visible in B).
- D. Circadian phases at division (normalized division times) show a unimodal distribution. Inset: longer circadian intervals correlate with mitosis occurring at later circadian phases.

**Aims:**
- Import flavin signals and birth events from a population of same-strain cells subject to the same nutrient conditions.
    - Start from the more reliable BY4741 or FY4 time series first, to iron out any issues.
    - Real purpose is to do this with the mutants, especially with the `swe1_Del` from the Causton lab.  It's only going to be a matter of switching the dataset file.
- Process data: cut time series to duration of interest (births matrix should be cut too), detrend flavin signals.
- Align time series to the first birth event.
- Locate the subsequent metabolic cycle.
    - This may be through finding a peak, using the appropriate tolerance.
    - Fitting an autoregressive model or Gaussian process may be helpful.
    - I will also need to identify the troughs so that I only get one metabolic cycle (see heatmap in next main item).
    - I expect this to be the most technically difficult.  If all else fails, plotting a heatmap with birth times overlaid 'for now' may still help produce insight.
- Plot a heatmap that shows:
    - Where the second birth event is in relation to the first birth event.
    - Where the subsequent metabolic cycle is in relation to the first metabolic cycle.
- Plot histograms that show the distributions of:
    - Time differences between first and second birth events.
    - Time differences between first birth event and peak of metabolic cycle.

**Paradigms:**
- Use `aliby` data structures, i.e. `pandas` `DataFrames` with multi-indexing.
    - Births has only been recently (2022-01-18) added to the `DataFrame` output.  Need to look at how it is done there first, and try to replicate.
- Use `postprocessor` processes e.g. `fft` (Fourier transform), `autoreg` (autoregressive model -- add methods as appropriate).
- Ultimate goal to put all the cells together in a script to put in `skeletons`, but some people prioritise having plots fast over having clean code.

In [None]:
import PyQt5
%matplotlib qt

**When going through this again: if I repeat lines of code, put them in a function to make life easier.**

# Import data

**Important note: must be mindful of how I'm dealing with TWO sets of signals -- flavin and births.  Current code, copied from `svm_sandbox.ipynb`, deals with ONE set of signals; having two sets is another can of worms.  Name the data variables and manage them wisely.**

In [None]:
import numpy as np
import pandas as pd
import csv

# PARAMETERS
#filename_prefix = './data/arin/Omero19979_'
filename_prefix = './data/arin/Omero20016_'
#

# Import flavin signals
signal_flavin = pd.read_csv(filename_prefix+'flavin.csv')
signal_flavin.replace(0, np.nan, inplace=True) # because the CSV is constructed like that :/

# Import birth signals
signal_births = pd.read_csv(filename_prefix+'births.csv')
#signal_births.replace(0, np.nan, inplace=True)

def convert_df_to_aliby(
    signal,
    strainlookup_df,
):
    # Import look-up table for strains (would prefer to directly CSV -> dict)
    strainlookup_dict = dict(zip(strainlookup_df.position, strainlookup_df.strain))
    
    # Positions -> Strain (more informative)
    signal = signal.replace({'position': strainlookup_dict})
    signal.rename(columns = {"position": "strain"}, inplace = True)
    signal = signal.drop(['distfromcentre'], axis = 1)

    # Convert to multi-index dataframe
    signal_temp = signal.iloc[:,2:]
    multiindex = pd.MultiIndex.from_frame(signal[['strain', 'cellID']])
    signal = pd.DataFrame(signal_temp.to_numpy(),
                          index = multiindex)
    
    return signal

strainlookup_df = pd.read_csv(filename_prefix+'strains.csv')
signal_flavin = convert_df_to_aliby(signal_flavin, strainlookup_df)
signal_births = convert_df_to_aliby(signal_births, strainlookup_df)

In [None]:
signal_flavin

In [None]:
signal_births

# Choose a list of cells as working data

List strains

In [None]:
signal_flavin.index.get_level_values(0).unique().to_list()

Define `signal_flavin_wd` as working data

In [None]:
strain = 'by4741'

signal_flavin_wd = signal_flavin.loc[strain]
signal_births_wd = signal_births.loc[strain]

# Processing time series

## Range

Chop up time series according to `interval_start` and `interval_end`, then remove cells that have NaNs.  Print number of cells.

In [None]:
# PARAMETERS
interval_start = 25
interval_end = 168
#

signal_flavin_processed = signal_flavin_wd.iloc[:, interval_start:interval_end].dropna()
signal_births_processed = signal_births_wd.iloc[:, interval_start:interval_end].dropna() # don't expect NaN here, but just for consistency

# Note: number of rows from the two DataFrames may be different from now on.
# This is expected because the flavin DataFrame has NaNs but the births doesn't.
# This is okay because matching the two will be based on the 'cellID' index, which never changes.

## Detrend

Using sliding window

In [None]:
# POTENTIAL ISSUE: This removes some time points.  Need to find a way to re-align births so that it makes sense.
# I expect that there is a constant shift that is a function of the size of the sliding window.
# ----> seems like no re-aligning needed???  After a cursory look at some time series & where their peaks should be.

import matplotlib.pyplot as plt
import seaborn as sns

# PARAMETERS
window = 45
#

fig, ax = plt.subplots()
sns.heatmap(signal_flavin_processed)
plt.title('Before detrending')
plt.show()

def moving_average(input_timeseries,
                  window = 3):
    processed_timeseries = np.cumsum(input_timeseries, dtype=float)
    processed_timeseries[window:] = processed_timeseries[window:] - processed_timeseries[:-window]
    return processed_timeseries[window - 1 :] /  window

signal_flavin_processed = signal_flavin_processed.div(signal_flavin_processed.mean(axis = 1), axis = 0)
signal_flavin_movavg = signal_flavin_processed.apply(lambda x: pd.Series(moving_average(x.values, window)), axis = 1)
signal_flavin_norm = signal_flavin_processed.iloc(axis = 1)[window//2: -window//2] / signal_flavin_movavg.iloc[:,0:signal_flavin_movavg.shape[1]-1].values

fig, ax = plt.subplots()
sns.heatmap(signal_flavin_norm)
plt.title('After detrending')
plt.show()

signal_flavin_processed = signal_flavin_norm

# Re-align births because some time points were removed/shifted
# Doing it this way so that column headers are consistent with the (modified) flavin DataFrame
signal_births_processed = signal_births_processed.iloc(axis=1)[window//2: -window//2]

## Align to first birth

Save dataframes (I plan to remove this step later...)

In [None]:
signal_flavin_save = signal_flavin_processed
signal_births_save = signal_births_processed

Align

In [None]:
signal_births_aligned = signal_births_save
signal_flavin_aligned = signal_flavin_save

births_shifted_rows = []
flavin_shifted_rows = []
# Match flavin and birth signals by cellID (shared and not changed)
for cellID in signal_flavin_aligned.index:
    # Identify first birth and define shift
    birth_locs = np.where(signal_births_processed.loc[cellID].to_numpy() == 1)[0]
    if birth_locs.any():
        shift = birth_locs[0]
    else:
        shift = 0
    # Make first birth the first time point
    # When shifted, the rest of the df is NaNs
    births_shifted_rows.append(
        signal_births_aligned.loc[[cellID]].shift(periods = -shift, axis = 'columns')
    )
    # Shift flavin signals accordingly
    flavin_shifted_rows.append(
        signal_flavin_aligned.loc[[cellID]].shift(periods = -shift, axis = 'columns')
    )
# Re-construct dataframes
signal_births_aligned = pd.concat(births_shifted_rows, ignore_index = True)
signal_births_aligned.set_index(signal_flavin_save.index, inplace = True) # note FLAVIN here -- needed to match dimensions

signal_flavin_aligned = pd.concat(flavin_shifted_rows, ignore_index = True)
signal_flavin_aligned.set_index(signal_flavin_save.index, inplace = True)

Remove cells without births

In [None]:
signal_births_drop = signal_births_processed.loc[signal_births_processed.sum(axis = 1) != 0, :]

signal_births_aligned = signal_births_aligned.loc[
    signal_births_drop.index.intersection(signal_births_aligned.index)
]
signal_flavin_aligned = signal_flavin_aligned.loc[
    signal_births_drop.index.intersection(signal_flavin_aligned.index)
]

Arrange cells by interval between first and second birth

In [None]:
# Find intervals between first and second births
second_births_intervals = []
for cellID in signal_births_aligned.index:
    birth_locs = np.where(signal_births_aligned.loc[cellID].to_numpy() == 1)[0]
    birth_locs = np.delete(birth_locs, 0) # the first element of this list is always zero
    if birth_locs.any():
        second_births_intervals.append(birth_locs[0]) # this is when the 2nd birth is in relation to the 1st
    else:
        second_births_intervals.append(None)
# Absence of second birth represented by nan
second_births_intervals = np.array(second_births_intervals, dtype=float)

# Rearrange order of rows in dataframes according to time of second birth
def rearrange_by_sorted_list(
    df,
    my_list,
):
    return df.reindex(df.index[np.argsort(my_list)].to_list())

signal_births_aligned = rearrange_by_sorted_list(signal_births_aligned, second_births_intervals)
signal_flavin_aligned = rearrange_by_sorted_list(signal_flavin_aligned, second_births_intervals)

# Locate metabolic cycle peaks

In [None]:
# (add code here)

# Plot heatmaps

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.preprocessing import StandardScaler

# Apparently matplotlib.pyplot.imshow() is better?
# https://stackoverflow.com/questions/44012488/overlay-two-heatmaps-in-seaborn-one-being-frames-around-cells-in-the-other-fro

# Produce masked array for births
births_array = signal_births_aligned.to_numpy()
births_heatmap_mask = np.ma.masked_where(births_array == 0, births_array)

# Scale flavin signals
flavin_array = signal_flavin_aligned.to_numpy()
scaler = StandardScaler().fit(flavin_array.transpose())
signal_flavin_aligned_scaled = scaler.transform(flavin_array.transpose())
signal_flavin_aligned_scaled = signal_flavin_aligned_scaled.transpose()

# Superimpose flavin and births heatmaps
flavin_heatmap = plt.imshow(
    signal_flavin_aligned_scaled,
    cmap = cm.RdBu_r,
)
births_heatmap = plt.imshow(
    births_heatmap_mask,
    interpolation='none',
)

# Labelling
plt.xlabel('Time point')
plt.ylabel('Cell')
plt.show()

# Plot histograms

In [None]:
# - From birth signals: find duration between 1st and 2nd birth (should just be an easy subtraction problem)
# - From flavin signals: find duration between peak (which should already be found earlier)
#   and the corresponding 1st birth.
# - When these durations are found, it should be straightforward with matplotlib.pyplot.hist()

Duration between 1st and 2nd births

In [None]:
sampling_period = 5
binsize = 5

plt.hist(
    second_births_intervals * sampling_period,
    np.arange(0, binsize * (np.nanmax(second_births_intervals * sampling_period)//binsize + 2), binsize)
)
plt.xlabel('Time (min)')
plt.ylabel('Frequency')
plt.title('Distribution of duration between 1st and 2nd births')
plt.show()