# MEDPC Data Tutorial

**About:** The purpose of this tutorial is to tidy and plot data. 

**Contact:**
* Dexter Kim: dexterkim2000@gmail.com
* Brett Hathaway: bretthathaway@psych.ubc.ca

**Requirements**
* The data must be an excel file from MEDPC2XL (trial by trial data) 
* The data, rgt_functions.py file, and this notebook must be in your current working directory

**Note: This tutorial is split into multiple sections**
* Section 1: Loading Data into Python 
* Section 2: Acquisition Analysis and Plotting*
* Section 3: Latin Square Analysis and Plotting*
* Section 4: Miscellaneous (optional)

*Depending on your data, you should only have to complete one of section 2 and 3

**Please run the following cell!**

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
#MEDPC rat gambling task functions imports, will print "I am being executed!" if functional
import rgt_functions as rgt

#main imports 
import os
import pandas as pd
import numpy as np

# plotting imports 
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator

# stats imports 
import scipy.stats as stats

#the following line prevents pandas from giving unecessary errors 
pd.options.mode.chained_assignment = None

***
# 1) Load data into Python
* Assign the names of the files that you want to analyze to `file_names`
* `load_data()` outputs a table similar to the excel sheet(s) you loaded in. (in the order established in `file_names`) 
    * note: `df` means dataframe, and is an object that will store your dataframe (table containing your data) 
* passing `reset_sessions = True` ###makes the session numbers start from 1 again (you may want to do this for baseline analysis)
* `load_multiple_data()` ###loads in multiple cohorts (with the same subject numbers) and assigns them unique subject numbers (ex. subject 1 of cohort 1 --> subject 101) 

In [None]:
file_names = ['BH06_raw_round1-infusions.xlsx', 'BH06_raw_round1-makeup.xlsx'] 

df = rgt.load_data(file_names)

#load_data won't output a dataframe itself. Use the following function to view the top of your dataframe. Note: it should look the exact same as your first excel file. 

df.head()

***
# 2A) Acquisition Analysis Section (Analysis by Session)

**Please Note! If you are trying to do Latin Square Analysis, skip section 2A and 2B and finish section 3A and 3B**

Set your objects! These will be used in the rest of section 2A and 2B. Examples are left in for clarity
* Assign the names of the control and experimental group **in order** to `group_names`
* Assign the title of the figures to `title`
* Assign the range of sessions you want to analyze to `startsess` and `endsess`
* Assign the names of the control and experimental group **in order** to `group_names`### as a dictionary 
* Assign the rat subject numbers to `exp_group` and `control_group`

In [None]:
group_names = ['Tg negative','Tg positive'] 

title = 'Nigrostriatal activation during acquisition' 
startsess = 29 #first session in this dataset
endsess = 30 #last session in this dataset

group_names_2 = {0: 'tg negative',
              1: 'tg positive'} 

exp_group = [1, 2, 7, 8, 11, 12, 16, 19, 20, 21, 22, 25, 26, 29, 32] #Tg positive

control_group = [3, 4, 5, 6, 9, 13, 14, 15, 17, 18, 23, 24, 27, 28, 30, 31] #Tg negative

**Check your session data**
* `check_sessions` gives us a summary for each rat (subject) including session numbers, session dates and # trials for each session.
* This allows us to see if there are any missing/incorrect session numbers, and if MED-PC exported all of the desired data into the Excel file you loaded in (`file_names`).  

In [None]:
rgt.check_sessions(df)

**To drop/edit data from certain session(s)**
* In `rgt.drop_sessions`, write the `session_num` with the session number data you want to remove
    * For example, to remove all data from session 28 and 29, I would write: `rgt.drop_sessions(df, [28, 29])`
    * **Make sure to remove the correct session(s)**, if you remove the wrong session and want to put the data back, you'll have to restart the Kernel and restart from `load_data`
    * Requirement: session number must exist in the session column of df
    
    
* In `rgt.edit_sessions`, write the original session numbers you want to remove, and the numbers you want to replace them with, in the correct order 
    * For example, to change **all** 30s to 29s, and 31s to 30s, I would write: `rgt.edit_sessions(df2, [30, 31], [29, 30], subs = "all")`
    * If you want to make edits, **for certain subjects**, I would assign `subs` to the subject numbers. For example, I would write `subs = [17, 21]


In [None]:
# rgt.drop_sessions(df, session_num) 
#needs to be assigned to new object
df2 = rgt.drop_sessions(df, [28])

In [None]:
# rgt.edit_sessions(df2, [30, 31], [29, 30], subs = "all")

**Check that you dropped/edited the desired session(s)**

In [None]:
rgt.check_sessions(df2) #df2

**Run the following cell to acquire a summary of your data.**

The rows represent subjects (rats 1 to n)

The columns are explained below:
* `##P#` represents the percent choice of each option. For example, `29P1` represents the percentage of times P1 was selected during the 29th session. 
* `risk##` represents the risk score for each session: (P1 + P2) - (P3 + P4) 
* `collect_lat##` represents the mean collect latency for each session
* `choice_lat##` represents the mean choice latency for each session 
* `trial##` represents the number of trials (not including premature responses or omissions) for each session
* `prem##` represents the number of premature responses for each session

In [None]:
df_sum = rgt.get_summary_data(df2)
df_sum #prints the dataset 

**Get the risk status of the rats using the following code**
* Note: 
    * `risk_status == 1` indicates a positive risk score (optimal) 
    * `risk_status == 2` indicates a negative risk score (risky)
    * `mean_risk` is the mean risk score averaged across the sessions between `startsess` and `endsess` for a given subject
        * You can change `startsess` and `endsess` by passing the session numbers instead. For example, `rgt.get_risk_status(df_sum, 28, 30)`
        * Requirement: `startsess` and `endsess` must be in the df
    * `print(risky, optimal)` prints out 2 list of rat subjects: the risky rats, and the optimal rats 

In [None]:
df_summary, risky, optimal = rgt.get_risk_status(df_sum, startsess, endsess)

print(df_summary[['mean_risk','risk_status']]) 
print(risky, optimal) 

**Export your data to an Excel file!** 
* Assign the name of the column that will specify the control vs. experimental group. Ex) `tg_status`
* Assign the name of the **new** Excel file `new_file_name`

In [None]:
rgt.export_to_excel(df_summary, [control_group, exp_group], column_name = 'tg_status', new_file_name = 'BH07_free_S29-30.xlsx')

**Summarize your data by experimental/control set**
* If you only want to view certain columns, specify them in mean_scores 
    * For example, `mean_scores[['risk29', 'risk30']]` will create a table with only those 2 columns
* Each value is the mean for that column (ex. `29P1`) within the set (`tg negative` or `tg positive`) 

In [None]:
mean_scores, stderror = rgt.get_group_means_sem(df_summary, [control_group, exp_group], group_names_2) ###[control_group, exp_group] = groups 
mean_scores #all mean scores
# mean_scores[['risk29', 'risk30']] #specify columns

# 2B) Acquisition Analysis (Plotting Section)

**Graph of the table above**
* `variable` specifies the variable you want to plot. 
    * For example, if I want to plot `choice_lat` over sessions for the experimental and control group, I would replace `variable` with `'choice_lat'`
    * Requirement: you must replace `variable` with a column name without the session numbers. For example, to plot mean risk scores, I would write `risk`, not `risk29`
* `startsess` and `endsess` can also be replaced with the range of session numbers you'd like to plot 
    * For example, if I want to plot `choice_lat` over sessions 29 to 31, I would replace `startsess` and `endsess` with `29` and `31` respectively
    * Requirement: `startsess` and `endsess` must be in the df

In [None]:
rgt.rgt_plot('risk', startsess, endsess, group_names_2, title, mean_scores, stderror, var_title = 'risk score') ##group_names_2 has to be a dict not a list 

**Transform the above data from a line plot, to a bar plot** 

In [None]:
rgt.rgt_bar_plot('risk', startsess, endsess, group_names, title, mean_scores, stderror, var_title = None)

**Bar plot of P1-P4 choices**
* The following bar plot plots the mean P1-P4 choices for the tg negative and tg positive groups 

In [None]:
rgt.choice_bar_plot(startsess, endsess, mean_scores, stderror)

***
# 3A) Latin Square Analysis (Analysis by Group) 

**Please note! If you are trying to perform Acquisition Analysis, finish section 2A and 2B, and skip section 3A and 3B**

Set your objects! These will be used in the rest of section 4A and 4B. Examples are left in for clarity 
* Assign the names of the files that you want to analyze to `file_names`
* Assign the names of the control and experimental group### **in order** to `group_names`### as a dictionary 
* Assign the rat subject numbers to `lOFC` and `PrL`

In [None]:
ls_group_names = {0:'lOFC',
               1:'PrL'} 

lOFC = [1,2,3,4,5,8,9,11,12,13,14,15,16]

PrL = [6,7] ##have to change this back with the new data

ls_groups = [lOFC, PrL]

**Check your group data for errors**
* This will show you the number of trials performed by each subject-group pairing. 

In [None]:
rgt.check_groups(df)

**Drop/Edit your group data**
* For example, if I want to change `group = 0` to `group = 1` for all subjects, I would write `rgt.edit_groups(df, orig_group = [0], new_group = [1], subs = "all")`
    * If I want to do the same thing but only for subject 2 and 3, change `subs = "all"` to `subs = [2,3]`
* For example, if I want to remove the data for subjects 5, 9 and 12, I would write `rgt.drop_subjects(df, subs = [5, 9, 12])`

In [None]:
rgt.edit_groups(df, orig_group = [0], new_group = [3], subs = [5])

# rgt.drop_subjects(df, subs = [5, 9, 12])

**Check that you edited the group desired**

In [None]:
rgt.check_groups(df)

**Run the following cell to acquire a summary of your data.**

The rows represent subjects (rats 1 to n)

The columns are explained below:
* `##P#` represents the percent choice of each option. For example `1P1` represents the percent choice of P1 
* `risk##` represents the risk score for each group: (P1 + P2) - (P3 + P4) 
* `collect_lat##` represents the mean collect latency for each group
* `choice_lat##` represents the mean choice latency for each group
* `trial##` represents the number of trials (not including premature responses or omissions) for each group
* `prem##` represents the number of premature responses for each group

In [None]:
df1 = rgt.get_summary_data(df, mode = 'Group')
df1

**Impute missing data (if necessary)**
* For example, if you have missing data for subject 12, session 2, you can impute (take the mean of the session before and after). 
    * Code: `rgt.impute_missing_data(df, session = 2, subject = 12, choice = 'all', vars = 'all')`

In [None]:
df_group_summary = rgt.impute_missing_data(df1, session = 2, subject = 12, choice = 'all', vars = 'all')
df_group_summary

**Get risk status of the vehicle**

###This skipped a lot of steps included in the BH06 LS analysis ###have risky vs optimal graphs, and also have by lOFC/PrL 

In [None]:
df1,risky,optimal = rgt.get_risk_status_vehicle(df1) 
df1

In [None]:
#make lists for risky and optimal rats in each group
#magical function that does that (value for value)

#calculate means for risky and optimal by group


#group_means_risk, sem_risk = rgt.get_group_means_sem(df1, ls_groups_risk, ls_group_names_risk) 

**Export your data to an Excel file!** 
* Assign the name of the column that will specify the control vs. experimental group. Ex) `brain_region`
* Assign the name of the **new** Excel file `new_file_name` 

In [None]:
rgt.export_to_excel(df1, ls_groups, 'brain_region', new_file_name = 'BH06_all-data2.xlsx')

**Summarize your data by experimental/control set**
* If you only want to view certain columns, specify them in group_means 
    * For example, `group_means[['omit3', 'omit4']]` will create a table with only those 2 columns ###change
    * Each value is the mean for that column (ex. `omit3`) within the set (`lOFC` or `PrL`) ###is this correct?
    * Requirements: PrL and lOFC must contain subject numbers that are present in the df, and must contain at least 2 elements (numbers) 

In [None]:
group_means, sem = rgt.get_group_means_sem(df1, ls_groups, ls_group_names) 
group_means
# group_means[['omit3', 'omit4']]

# 3B) Plotting Section (by Groups) 

**Graphs of the table above**
* Data is separated by brain region: `lOFC` and `PrL` 

In [None]:
rgt.ls_bar_plot('lOFC',group_means,sem)

In [None]:
rgt.ls_bar_plot('PrL',group_means,sem)

**Graph certain variables from the table above**

* `variable` specifies the variable you want to plot. 
    * For example, if I want to plot `choice_lat` over sessions for the experimental and control group, I would replace `variable` with `'choice_lat'`
    * Requirement: you must replace `variable` with a column name without the session numbers. For example, to plot mean risk scores, I would write `risk`, not `risk1`
* `startsess` and `endsess` can also be replaced with the range of session numbers you'd like to plot 
    * For example, if I want to plot `choice_lat` over sessions 29 to 31, I would replace `startsess` and `endsess` with `29` and `31` respectively ###not sure about this line
    * Requirement: `startsess` and `endsess` must be in the df

In [None]:
rgt.rgt_plot('omit',1,4,ls_group_names,'5-HT2c Antagonist',group_means,sem,var_title = 'Premature response')

In [None]:
#plots for risky vs optimal 
#use group names to specify choice plot for risky or optimal rats 

In [None]:
#line plot with risky and optimal rats from both groups

***
# 4) Miscellaneous Section (more advanced code) 

**Change your working directory**

Instructions: 
* Check your current working directory by running `os.getcwd()` 
* From your working directory, make a data folder (call it: data), and add your .xlsx file into that folder. 
* Change `('C:\\Users\\dexte\\hathaway_1\\data')` to your current working directory and add '\\data'
* For example, my current working directory is `'C:\\Users\\dexte\\hathaway_1'`, so I enter `'C:\\Users\\dexte\\hathaway_1\\data'` into the brackets (slashes will be different if you are not using windows). 
* This saves all data in your data folder, instead of your current working directory. 

In [None]:
#checks current working directory
os.getcwd()

#changes working directory to whatever is included in brackets
os.chdir('C:\\Users\\dexte\\hathaway_1\\data') 