# Introduction

**Welcome to Keypoint MoSeq Analysis Visualization Notebook!**

This notebook contains a collection of analyses we typically perform on behavioral recordings to better understand mouse behavior. Examples of such analyses are:
- adding experimental group labels to the session
- descriptive labeling of individual syllables
- comparing syllable usage across different experimental groups
- comparing syllable transition structure across different experimental groups

The goal of this notebook is to provide you with a basic summary of syllable usage and sequencing statistics across experimental conditions and to help generate more specific questions related to observations you may identify here.
If you want to perform more specific analyses, you can export the MoSeq data (see below) to use with the tool (i.e., Excel, Python, R, Matlab) you're most comfortable with.

In this notebook, the **Markdown** above each cell describes the purpose of the cell(s), cell output, and the instructions for running the cell(s) and/or interacting with the widget. The **inline code comments** in the code block provides contextual information about the function, code structure, and parameters.

***

# Project setup

## Files and Directory Structure
To run this notebook, you need the following files and folders in your data directory:
- `progress.yaml` - this file stores all the required paths used throughout the notebooks. The first time you run this notebook, a `progress.yaml` file will be created for you. If you have already run this notebook before, you can load the `progress.yaml` file by running the cell below.
- `config.yml` - this file stores the configuration parameters for the pipeline.
- `index.yaml` - this file stores the metadata for each recording and group labels for different experimental conditions. The first time you run the interactive group setting component, a `index.yaml` file will be created for you. 
- `<model_dir>` - the folder that contains the model and the model results. By default, it is the timestamp when the model is initialized. The folder should contain:
  - `crowd_movies`
  - `grid_movies`
  - `trajectory_plots`
  - `checkpoint.p`
  - `results.h5`

At this stage, your base directory should look something like what's shown below:
```
.
└── <project_dir>/
    ├── <model_dir>/
        ├── crowd_movies/
        ├── grid_movies/
        ├── trajectory_plots/
        ├── checkpoint.p
        └── results.h5
    ...
    └── progress.yaml

```
For more information about how MoSeq organizes data, check out our [wiki](https://github.com/dattalab/moseq2-app/wiki/Directory-Structures-and-yaml-Files-in-MoSeq-Pipeline).

**Note: this notebook uses the `progress.yaml` file to keep track of all the necessary paths.** Please ensure you run the Load Progress cell below before running any analysis modules.

In [None]:
from keypoint_moseq.analysis import track_progress

project_dir = './' # project directory
model_dirname = 'model_dir' # model directory name for the model to analyze
input_dir = './videos' # input video directory path
progress_filename = 'progress.yaml' # progress file name

progress_paths = track_progress(model_dirname, project_dir, input_dir, 'progress.yaml')

## Select a different model to analyze

In [None]:
from os.path import join
from keypoint_moseq.analysis import update_model_progress

model_dirname = model_dirname # select desired model name

# update all relevant paths for the new model
progress_paths = update_model_progress(progress_paths, model_dirname, join(project_dir, 'progress.yaml'))

# Assign Groups

Sessions can be given group labels for analyses comparing different cohorts or experimental conditions and the labels will be stored in the `index.yaml` file. You can find more abaout this widget in the wiki.

The following cell creates the `index.yaml` file if necessary, runs the widget to assign groups and updates group information to the `index.yaml`.

**Instructions:**
- **Run the following cell**.
- **Click the column header** to sort the column and use the filter icon to filter if needed.
- **Click on the session** to select the session. **To select multiple sessions, click the sessions while holding down the [Ctrl]/[Command] key, or click the first and last entry while holding down the [Shift] key.**
- **Enter the group name in the `Desired Group Name` field** and click `Set Group` to update the `group` column for the selected sessions.
- Click the `Update Index File` button to save current group assignments.


In [None]:
from keypoint_moseq.analysis import interactive_group_setting
progress_paths = interactive_group_setting(progress_paths)

# Compute Syllable Satistics

## Compute `moseq_df`
The following cell generates a `DataFrame` of scalar values computed during the extraction step aligned to the model labels. All sessions are concatenated to form this `DataFrame`. Each row is one "frame" which corresponds to one sample from the original depth video. Each column is a different entry.
To view the entries of this `DataFrame`, run `print(moseq_df.columns)`. This `DataFrame` can be used to plot the scalar feature values for any session over time.

You will notice multiple columns with syllable labels, `syllable` and `syllable_reindexed`. The `syllable` column contains the syllable labels as they were extracted from the model. The `syllable_reindexed` column contains the syllable labels after they have been sorted by frequency and reindexed such that the lower syllable index corresponds to the most frequently used syllable. `smooth_heading` indicates whether the output heading is smoothed or not. `smooth_heading` is `True` if the output heading is smoothed and `False` if the output heading is not smoothed.

**Instructions:**
- **Run the following cell**.

In [None]:
from keypoint_moseq.analysis import compute_moseq_df

# compute session scalar data
moseq_df = compute_moseq_df(progress_paths, smooth_heading=True)
print('moseq_df size: ', moseq_df.shape[0], 'rows;', moseq_df.shape[1], 'columns')

# this line prints out the first 5 rows of the dataframe in table format. It is used to
# get a sense of what is contained in the DataFrame
moseq_df.head()

## Export `moseq_df`

You can export the `moseq_df` to a csv file (or other alternative file types) for further analysis using the following cell. [See here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) for a list of alternative file formats to save `moseq_df`.

**Note:** it can take quite a while to export the `moseq_df`, especially if it is saved as a `.csv` file. Check out [pandas's user guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) to see if other formats will work better for you.

**Instructions:**
- **Specify the folder** you want to save the dataframe in `save_path`. By default, the file will be saved to `base_dir`.
- **Run the following cell** to save `moseq_df` as a CSV file.

In [None]:
# Save `moseq_df` as a csv file
from os.path import exists, join

# Specify the place you want to save the dataframe in `save_path`
save_path = progress_paths['model_dir']  # save the dataframe in the model specific folder

# exports the dataframe
filename = 'moseq_df.csv'
# here we use .to_csv to export the dataframe, but you can change it to try other ways of saving your data
moseq_df.to_csv(join(save_path, filename), index=False)

print('DataFrame is saved:', join(save_path, filename))

## Compute `stats_df`
`stats_df` is a `DataFrame` that contains statistical summaries (i.e., min, max, mean, std) of scalar values (kinematic values such as heading and velocity) associated with each syllable, as well as the frequency each syllable is expressed. By default, it is computed using the features included in `moseq_df` for each session independently. This dataframe will be used to plot syllable statistics and perform hypothesis testing.

The function can group data into whichever categories you supply into the `groupby` parameter.
By default, we group by "group" (the experimental cohort), "uuid" (each unique recording session) and "file_name"
Under the hood, we use `pandas` to perform the `groupby` calculation.
To learn more about `pandas`'s `groupby` capabilities, check out [their documentation](https://pandas.pydata.org/docs/user_guide/groupby.html#groupby) on the subject.

By default, each row of the `stats_df` contains the average syllable usage for one syllable for one group (experimental cohort) within one uuid (session). Changing the contents of the `groupby` variable will change the contents of the `stats_df`.

**Instructions:**
- **Run the following cell**.

In [None]:
# compute syllable usage and scalar statistics
from moseq2_viz.model.util import compute_behavioral_statistics

syllable_key = 'syllables_reindexed'  # either 'labels (usage sort)', 'labels', or 'labels (frames sort)'
groupby = ['group', 'uuid', 'file_name']  # can be any categorical variables
usage_normalization = True # syllable usages within one session add up to 1 when usage_normalization is True
threshold = 0.005  # threshold for syllable usage to include in the dataframe
fps = 30 # frame rate of the video

stats_df = compute_stats_df(moseq_df, threshold=threshold, groupby=groupby, fps=fps, syll_key=syllable_key, normalize=usage_normalization)
print('The shape of stats_df', stats_df.shape)
stats_df.head()

## Export `stats_df`

You can export the `stats_df` to a csv file for further analysis using the following cell.

**Instructions:**
- **Specify the place** you want to save the dataframe in `save_path`. By default, the file will be saved to `base_dir`.
- **Run the following cell** to save `stats_df` as a CSV file.

In [None]:
# Save `stats_df` as a csv file
from os.path import exists, join

# Specify the place you want to save the dataframe in `save_path`
save_path = progress_paths['model_dir']  # save the dataframe in the model specific folder

# exports the dataframe
filename = 'stats_df.csv'
# here we use .to_csv to export the dataframe, but you can change it to try other ways of saving your data
stats_df.to_csv(join(save_path, filename), index=False)

print('DataFrame is saved:', join(save_path, filename))

## Generate Behavioral Summary (Fingerprints)
Fingerprints summarize behavior by showing distributions of scalars (eg. position, velocity, height, and length) and syllables.
These plots can be used as a useful diagnostic tool.
Sessions where the mouse wasn't extracted properly or moves too much or little can be identified in these plots.
You can find more information in the wiki [here](https://github.com/dattalab/moseq2-app/wiki/MoSeq2-Analysis-Visualization-Notebook-Instructions#fingerprint-plots).

The following cell generates the summary dataframe and plots the behavioral summary. The fingerprint plot will be automatically saved as png and pdf in the `plots` folder in the model directory.

**Instructions:**
- **Set `n_bins` variable** to an integer to specify the number of bins for the MoSeq scalar values. Set `n_bins` variable to `None` if you want the number of bins to match the number of syllables. `n_bins` does not bin syllables.
- **Set `range_type` variable** to 'robust' to include data ranging from 1 percentile top 99 percentile. **Set `range_type` variable** to 'full' to include all the data.
- **Assign an `sklearn.preprocessing` object to `preprocessor` variable** if you want to scale the values by session. `preprocessor` variable is set to `None`, the figure will show the proportion of data filling each bin.
- **Run the following cell.**
