# MEDPC Data Tutorial

**About:** The purpose of this tutorial is to tidy and plot data. 

**Contact:**
* Dexter Kim: dexterkim2000@gmail.com
* Brett Hathaway: bretthathaway@psych.ubc.ca

**Requirements**
* The data must be an excel file from MEDPC2XL (trial by trial data) 
* The data, rgt_functions.py file, and this notebook must be in your current working directory

**Note: This tutorial is split into multiple sections**
* Section 1: Setting Variables (objects) 
* Section 2: Loading Data into Python 
* Section 3: Acquisition analysis and plotting
* Section 4: Latin Square analysis and plotting
* Section 5: Miscellaneous

**Please run the following cell!**

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
#MEDPC rat gambling task functions imports, will print "I am being executed!" if functional
import rgt_functions as rgt

#main imports 
import os
import pandas as pd
import numpy as np

# plotting imports 
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator

# stats imports 
import scipy.stats as stats

#the following line prevents pandas from giving unecessary errors 
pd.options.mode.chained_assignment = None

# 1) Setting Variables (objects) 

Set your variables! These will be used later in the code. Example arguments are left in for clarity.

###Brett will edit

In [None]:
#in file_names (List[str]), add the file names you wish to read into Python 
# file_names = ['BH07_raw_free_S29-30.xlsx']

#in group_names (List[str]), add the names of the control and experimental group, respectively 
group_names = ['Tg negative','Tg positive'] 

title = 'Nigrostriatal activation during acquisition' #title for figures, describing the experiment
startsess = 29 #first session in this dataset
endsess = 30 #last session in this dataset

group_names_2 = {0: 'tg negative',
              1: 'tg positive'} ###group_names is used twice (as a list, and as a dict)

#the following two lines of code assign the rat subject numbers to the experimental and control group lists
exp_group = [1, 2, 7, 8, 11, 12, 16, 19, 20, 21, 22, 25, 26, 29, 32] #Tg positive

control_group = [3, 4, 5, 6, 9, 13, 14, 15, 17, 18, 23, 24, 27, 28, 30, 31] #Tg negative

###for Latin Square analysis
file_names = ['BH06_raw_round1-infusions.xlsx', 'BH06_raw_round1-makeup.xlsx']

ls_group_names = {0:'lOFC',
               1:'PrL'} 

lOFC = [1,2,3,4,5,8,9,11,12,13,14,15,16]

PrL = [6,7] ##have to change this back with the new data

ls_groups = [lOFC, PrL]

###suggestion, make a data cell for section 3 and 4
###solution: move the variables to section 3 and 4 

# 2) Load data into Python
* `load_data()` takes in one argument: file_names 
* `load_data()` outputs a table similar to the excel sheet(s) you loaded in. (in the order established in `file_names`) 
* note: `df` means dataframe, and is an object that will store your dataframe (table containing your data) 
* passing `reset_sessions = True` ###makes the sessions start from 1 again (you may want to do this for baseline analysis)
* `load_multiple_data()` ###loads in multiple cohorts (with the same subject numbers) and assigns them unique subject numbers (ex. subject 1 of cohort 1 --> subject 101) 

In [None]:
df = rgt.load_data(file_names)

#load_data won't output a dataframe itself. Use the following function to view the top of your dataframe. Note: it should look the exact same as your first excel file. 

df.head()

# 3A) Acquisition Analysis Section (Analysis by Session)

**Check your session data**
* `check_sessions` gives us a summary for each rat (subject) including session numbers, session dates and # trials for each session.
* This allows us to see if there are any missing/incorrect session numbers, and if MED-PC exported all of the desired data into the Excel file you loaded in (`file_names`). 
* `edit_sessions()` can change session numbers (not included in the tutorial) 

In [None]:
rgt.check_sessions(df)

**To drop/remove data from certain session(s)**
* replace `session_num` with the session number data you want to remove
* for example, to remove all data from session 28 and 29, I would write: `rgt.drop_sessions(df, [28, 29])`
* **make sure to remove the correct session(s)**, if you remove the wrong session and want to put the data back, you'll have to restart the Kernel and restart from `load_data`

###do you want drop_sessions to be part of the tutorial? Or edit_sessions()? ###solution: include 

In [None]:
# rgt.drop_sessions(df, session_num) #session_num is a list of integers
#needs to be assigned to new object
df2 = rgt.drop_sessions(df, [28])

In [None]:
### this is a cell showing that it works --> 30 becomes 29, and 31 becomes 30, for subjects 17 to 32

# rgt.edit_sessions(df2, [30, 31], [29, 30], subs = list(range(17,33)))

**Check that you dropped the session desired (in this example, we dropped data from session 28)**

In [None]:
rgt.check_sessions(df) #df2

**Run the following cell to acquire a summary of your data.**

The rows represent subjects (rats 1 to n)

The columns are explained below:
* `##P#` represents the percent choice of each option. For example, `29P1` represents the percentage of times P1 was selected during the 29th session. 
* `risk##` represents the risk score for each session: (P1 + P2) - (P3 + P4) 
* `collect_lat##` represents the mean collect latency for each session
* `choice_lat##` represents the mean choice latency for each session 
* `trial##` represents the number of trials (not including premature responses or omissions) for each session
* `prem##` represents the number of premature responses for each session

In [None]:
df_sum = rgt.get_summary_data(df)
df_sum #prints the dataset 

**Get the risk status of the rats using the following code**
* Note: 
    * `risk_status == 1` indicates a positive risk score (optimal) 
    * `risk_status == 2` indicates a negative risk score (risky)
    * `mean_risk` is the mean risk score averaged across the sessions between `startsess` and `endsess` for a given subject
        * you can change `startsess` and `endsess` by passing the session numbers instead. For example, `rgt.get_risk_status(df_sum, 28, 30)`
    * `print(risky, optimal)` prints out 2 list of rat subjects: the risky rats, and the optimal rats 

In [None]:
df_summary, risky, optimal = rgt.get_risk_status(df_sum, startsess, endsess)

print(df_summary[['mean_risk','risk_status']]) #printed 2 of many columns in df_summary ###this could be removed? ###solution: keep it
print(risky, optimal) #prints 2 lists: the subject numbers of the risky rats, and the subject numbers of the optimal rats

**Export your data to an Excel file!** 
* Note: `'tg_status'` is the column name that specifies the control vs. experimental group
* Note2: `'BH07_free_S29-30.xlsx'` specifies the name of the **new** Excel file 

###may want to change file_name to new_file_name for clarity ###change it 

In [None]:
rgt.export_to_excel(df_summary, [control_group, exp_group], column_name = 'tg_status', file_name = 'BH07_free_S29-30.xlsx')

##Brett will edit control_group

**Summarize your data by experimental/control set**
* if you only want to view certain columns, specify them in mean_scores 
    * For example, `mean_scores[['risk29', 'risk30']]` will create a table with only those 2 columns
    * Each value is the mean for that column (ex. `29P1`) within the set (`tg negative` or `tg positive`) ###is this correct?
    
###in addition, it's strange that we have control_group and exp_group as the objects for 'tg_negative' and 'tg_positive', but I think you already noticed this. 

In [None]:
mean_scores, stderror = rgt.get_group_means_sem(df_summary, [control_group, exp_group], group_names_2) ###[control_group, exp_group] = groups 
mean_scores #all mean scores
# mean_scores[['risk29', 'risk30']] #specify columns

# 3B) Acquisition Analysis (Plotting Section)

**Graph of the table above**
* `variable` specifies the variable you want to plot. 
    * For example, if I want to plot `choice_lat` over sessions for the experimental and control group, I would replace `variable` with `'choice_lat'`
* `startsess` and `endsess` can also be replaced with the range of session numbers you'd like to plot 
    * For example, if I want to plot `choice_lat` over sessions 29 to 31, I would replace `startsess` and `endsess` with `29` and `31` respectively

###this could be improved in description - 
###why does typing 'risk' work? ###explain what they can type into variables 

In [None]:
rgt.rgt_plot('risk', startsess, endsess, group_names_2, title, mean_scores, stderror, var_title = 'risk score') ##group_names_2 has to be a dict not a list 

**Transforms the above data from a line plot, to a bar plot** 
* Must use the same arguments ##hard for me to tell why this would occur

In [None]:
rgt.rgt_bar_plot('risk', startsess, endsess, group_names, title, mean_scores, stderror, var_title = None)

**Bar plot of P1-P4 choices**
* The following bar plot plots the mean P1-P4 choices for the tg negative and tg positive groups 

In [None]:
rgt.choice_bar_plot(startsess, endsess, mean_scores, stderror)

# 4A) Latin Square Analysis (Analysis by Group) 

* Please note! This is the same workflow as the acquisition analysis 

**Check your group data**
* This will show you the number of trials performed by each subject-group pairing. 

In [None]:
rgt.check_groups(df)

**Edit your group data**
* For example, if I want to change 0 to 1 for all subjects, I would write `rgt.edit_groups(df, orig_sess = [0], new_sess = [1], subs = "all")`
    * If I want to do the same thing but only for subject 2 and 3, change `subs = "all"` to `subs = [2,3]`
* For example, if I want to remove the data for subjects 5, 9 and 12, I would write `rgt.drop_subjects(df, subs = [5, 9, 12])`

In [None]:
rgt.edit_groups(df, orig_group = [0], new_group = [3], subs = [5])

# rgt.drop_subjects(df, subs = [5, 9, 12])

**Check that you edited the group desired**

In [None]:
rgt.check_groups(df)

**Run the following cell to acquire a summary of your data.**

The rows represent subjects (rats 1 to n)

The columns are explained below:
* `##P#` represents the percent choice of each option. For example `1P1` represents the percent choice of P1 
* `risk##` represents the risk score for each group: (P1 + P2) - (P3 + P4) 
* `collect_lat##` represents the mean collect latency for each group
* `choice_lat##` represents the mean choice latency for each group
* `trial##` represents the number of trials (not including premature responses or omissions) for each group
* `prem##` represents the number of premature responses for each group

In [None]:
df1 = rgt.get_summary_data(df, mode = 'Group')
df1

###Impute your missing data 
* For example, if you have missing data for subject 12, session 2, you can impute (take the mean of the session before and after). 
    * Code: `rgt.impute_missing_data(df, session = 2, subject = 12, choice = 'all', vars = 'all')`

In [None]:
df_group_summary = rgt.impute_missing_data(df1, session = 2, subject = 12, choice = 'all', vars = 'all')
df_group_summary

**Summarize your data by experimental/control set**
* If you only want to view certain columns, specify them in group_means 
    * For example, `group_means[['omit3', 'omit4']]` will create a table with only those 2 columns ###change
    * Each value is the mean for that column (ex. `omit3`) within the set (`lOFC` or `PrL`) ###is this correct?
    
###this doesn't work - something about stuff being not supported... didn't have time to rewrite it. I really tried...
###solutionY: PrL and lOFC cannot contain subject labels that aren't present in the data --> KeyError
###solutionX: PrL and lOFC must contain at least 2 subjects. 
###solution: check that you have X and Y 

In [None]:
group_means, sem = rgt.get_group_means_sem(df1, ls_groups, ls_group_names) 
group_means
# group_means[['omit3', 'omit4']]

**Get risk status of the vehicle**

###This skipped a lot of steps included in the BH06 LS analysis ###have risky vs optimal graphs, and also have by lOFC/PrL 
###This is obviously not exported... but I can include it before export

In [None]:
df1,risky,optimal = rgt.get_risk_status_vehicle(df1) 
df1

In [None]:
#make lists for risky and optimal rats in each group
#magical function that does that (value for value)

#calculate means for risky and optimal by group


#group_means_risk, sem_risk = rgt.get_group_means_sem(df1, ls_groups_risk, ls_group_names_risk) 

**Export your data to an Excel file!** 
* Note: `'tg_status'` is the column name that specifies the control vs. experimental group
* Note2: `'BH07_free_S29-30.xlsx'` specifies the name of the **new** Excel file 

###may want to change file_name to new_file_name for clarity 
###does work!

In [None]:
rgt.export_to_excel(df1,ls_groups,'brain_region','BH06_all-data2.xlsx')

# 4B) Plotting Section (by Groups) 

**Graph of the table above**
* `variable` specifies the variable you want to plot. 
* For example, if I want to plot `choice_lat` over sessions for the experimental and control group, I would replace `variable` with `'choice_lat'`

##this could be improved in description

In [None]:
rgt.ls_bar_plot('lOFC',group_means,sem)

In [None]:
rgt.ls_bar_plot('PrL',group_means,sem)

In [None]:
rgt.rgt_plot('omit',1,4,ls_group_names,'5-HT2c Antagonist',group_means,sem,var_title = 'Premature response')

In [None]:
#plots for risky vs optimal 
#use group names to specify choice plot for risky or optimal rats 

In [None]:
#line plot with risky and optimal rats from both groups

# 5) Miscellaneous Section (more advanced code) 

**Change your working directory**

Instructions: 
* Check your current working directory by running line 2. 
* From your working directory, make a data folder (call it: data), and add your .xlsx file into that folder. 
* Change `('C:\\Users\\dexte\\hathaway_1\\data')` to your current working directory and add '\\data'
* For example, my current working directory is `'C:\\Users\\dexte\\hathaway_1'`, so I enter `'C:\\Users\\dexte\\hathaway_1\\data'` into the brackets (slashes will be different if you are not using windows). 
* This saves all data in your data folder, instead of your current working directory. 

##default: just have their data in their cwd (easier option)
##future: write a function that will save files in separate folder (for them)

In [None]:
#checks current working directory
os.getcwd()

#changes working directory to whatever is included in brackets
os.chdir('C:\\Users\\dexte\\hathaway_1\\data') 