<a href="#Overview"></a>
# Overview
* <a href="#Introduction">Introduction</a>
  * <a href="#Methods,-function,-data-types,-and-techiniques">Methods, function, data types, and techiniques</a>
    * <a href="#Introduced-in-today's-class">Introduced in today's class</a>
    * <a href="#Reviewed-from-previous-class">Reviewed from previous class</a>
* <a href="#Loading-the-data">Loading the data</a>
  * <a href="#Exercise-1:-read-a-CSV-file">Exercise 1: read a CSV file</a>
* <a href="#Selecting-and-averaging-subsets-of-data">Selecting and averaging subsets of data</a>
  * <a href="#Exercise-2:-using-masks">Exercise 2: using masks</a>
  * <a href="#Exercise-3:-using-groupby">Exercise 3: using groupby</a>
  * <a href="#Exercise-4:-more-masks">Exercise 4: more masks</a>
  * <a href="#Exercise-5:-more-groupby">Exercise 5: more groupby</a>
* <a href="#Plotting-the-data">Plotting the data</a>
  * <a href="#Exercise-6:-plotting-mean-latency-by-condition">Exercise 6: plotting mean latency by condition</a>
  * <a href="#Exercise-7:-plotting-mean-latency-by-modality">Exercise 7: plotting mean latency by modality</a>
  * <a href="#Exercise-8:-plotting-mean-latency-by-group,-modality-and-condition">Exercise 8: plotting mean latency by group, modality and condition</a>
  * <a href="#Exercise-9:-plotting-percent-correct-by-group,-modality-and-condition">Exercise 9: plotting percent correct by group, modality and condition</a>

<a name="#Introduction"></a>
# Introduction
<a href="#Overview">Return to overview</a>

**Analysis of a Behavioral Working Memory Task: the N-Back**

Our project involves data collected from two different participant groups: those who have auditory processing problems following head injuries and non-head-injured control participants.

The data we’re working with this week involves a common cognitive working memory task called the N-Back.  In our version of the task, participants listen to strings of syllables and must decide whether the current stimulus matches the one displayed n trials ago, where n is a variable number that can be adjusted up or down to respectively increase or decrease cognitive load.  Our experimental conditions include a 0-Back, 1-Back, and 2-Back.  For the 0-Back, the participant only has to remember the first syllable they heard, and they press a button as soon as they hear it again:
![N-0 Back](ZeroBack.jpg)

For the 1-Back condition, the participant presses the button when the syllable that they hear matches the syllables that was just presented in the previous trial (e.g. pairs of the same syllable):
![N-1 Back](OneBack.jpg)

Finally, for the 2-Back condition, participants press the button when the syllable they hear matches the syllable they heard two syllables ago (e.g. syllable “sandwiches”):
![N2-Back](TwoBack.jpg)

So, as you can imagine, the task becomes harder as the value of N is increased and participants have to hold more syllables in working memory.  We are presenting stimuli in both visual and auditory modalities, and thus we have a total of 6 different stimulus conditions (0-, 1-, and 2-Back conditions in visual and auditory modalities), and are measuring how well participants are able to perform the task in each condition (measured as percent of correct button presses in response to target stimuli) as well as the latency of their responses.  Because this is a cognitively demanding task, each condition is measured during three separate runs in order to give folks frequent breaks.  

The basic questions we are trying to answer are as follows:
- Do participants show the expected reduction in accuracy and increase in latency with increasing n?
- Is there a difference in performance accuracy or speed according to sensory modality?
- Is there a difference in accuracy and speed between participant groups? For example, do participants with auditory processing problems demonstrate more trouble with the auditory modality than the visual modality?


**We can use Python to help us analyze the data to answer these questions!!**

<a name="#Methods,-function,-data-types,-and-techiniques"></a>
## Methods, function, data types, and techiniques
<a href="#Overview">Return to overview</a>

<a name="#Introduced-in-today's-class"></a>
### Introduced in today's class
<a href="#Overview">Return to overview</a>

* `basename` - Given a file path (e.g., `Documents/data/filename.csv`, return only the filename (e.g., `filename.csv`);
* `append()` - add items to a list;
* `glob()` - grabs file names can makes them into a list;
* `concat()` - concatinate lists into a dataframe;
* `split()` - split a string;
* `head()` - a convenient way to print out the top rows of a dataframe;
* `dtypes()` - get and set data types;
* `astype()` - specify a new data type (e.g., integer, string, etc.);
* `replace()` - replace certain items in your list or dataframe;
* `head()` - show only the top portion of your dataframe contents;
* `yerr` - to add error bars to plots on the y axis;
* `legend()` - work with the legend for your plots;
* `sem()` - to calculate the standard error of the mean
* `plt.ylabel()` - a function in the matplotlib library that allows you to specify the y axis label
* `set_ylabel()` - an alternate method for specifying the y-axis label
* masks - a technique for pulling out only specific data from your dataframe and making it into its own dataframe;
* boolean values - True/False values and how to use them;
* for loops - making iterations easy!

<a name="#Reviewed-from-previous-class"></a>
### Reviewed from previous class
<a href="#Overview">Return to overview</a>

* `read_csv()`;
* `loc()`;
* `groupby()`;
* `mean()`;
* `plot()`;
* `unstack()`;

<a name="#Loading-the-data"></a>
# Loading the data
<a href="#Overview">Return to overview</a>

Let's import some libraries to help us work with the data.  Run the following cell to import the libraries specified:

In [None]:
from glob import glob
import os.path

import pandas as pd

Our data are stored as comma-separated values (CSV) files.  First, it's important to know what the data look like in the CSV files so that we know what we're working with. Here's a screen shot of the first file in our list:
![First CSV file in our list](csv_file_pic.jpg)

The first line is our header, which includes the column names (i.e., trial number, response, type, correct, latency, and stim/response). Awesome! Since it's formatted so nicely, we can just use `pd.read_csv()` like we learned last week to read the files into a dataframe.

<a name="#Exercise-1:-read-a-CSV-file"></a>
## Exercise 1: read a CSV file
<a href="#Overview">Return to overview</a>

Just to refresh our memories from last week, go ahead and read in the first data file `data/1001_Aud_N0-Back_run1.csv` as a dataframe and print it. 

One nice feature about Jupyter Notebooks is that the output of the last line is always displayed below the cell. By default, if you put a variable name by itself on the last line, Jupyter Notebook will display a nicely-formatted representation of the variable.

In [None]:
%load "answers/answer_001.txt"

Nice!  But we seem to have two problems here.  First, the dataframe we've created doesn't have a lot of important information about the file that is contained in the filename including the subject ID, testing modality, N-back condition, and run number. Second, we have a lot of data files here, and it would be tedious to try to read them all in separately.

Let's tackle the first problem. How can we parse the information stored in the filename? The format of the filename is:

    subject_modality_condition_run.csv
    
For `1001_Aud_N0-Back_run1.csv`, this gives us:

* subject = 1001
* modality = Aud
* condition = N0-Back
* run = run1

Since our data is stored in a folder `data`, we need to specify the path to the file relative to the folder the notebook is stored in (i.e., `data/1001_Aud_N0-Back_run1.csv`).

We'll use the **`basename`** function (available via the `os.path` module) to get only the filename from the path to the file and the **`split`** method (available on string variables) that splits the string up by a particular character. The important varibales in our filename are conveniently separated by an underscore, so let's this to split the filename. Run the following code:

In [None]:
pathname = 'data/1001_Aud_N0-Back_run1.csv'
filename = os.path.basename(pathname)
filename

In [None]:
split = filename.split('_')
split

Great!!  Now we have a convenient list that contains each of the variables we need to include in our dataframe.  We can take this one step further and assign names to each of these variables in the following way using a technique known as unpacking. Using this trick, we can take a `list` or `tuple` of N elements and assign the values to the same number of variables on the right hand side of an equation, e.g.:

    value = ('A', 'B', 'C')
    value1, value2, value3 = value

It's a nice, short-hand way of doing the following:

    value1 = value[0]
    value2 = value[1]
    value3 = value[2]
    
Will the following work?

    value = ('A', 'B', 'C')
    value1, value2 = value
    
What about this?

    value = ('A', 'B')
    value1, value2, value3 = value
    
And, as a bonus trick question, what do you get?

    value = ['A']
    value1 = value
    
What if we do this instead?

    value = ['A']
    value1, = value
    
Why?

In [None]:
subject, modality, condition, run  = filename.split('_')
condition

Perfect!!  This also highlights the importance of carefully selecting how you name your datafiles, and also keeping them consistent.  Considering how to format your filename and being consistent in using this format will make it MUCH easier to work with the data.

Now let's tackle our second problem which is that we have a lot of data files to read in.  There's an easier way to read them into a dataframe rather than reading them all separately and merging the dataframes. Let's use the **`glob`** function to get a list of all the files we need to read in and combine it with a **`for`** loop to iterate through the list of files. The **`append`** method will save the result of each iteration to a list.

Go ahead and run the code below and see what you get:

In [None]:
#the '*' here tells the glob function to match any character, which will give us all filenames in the
# data folder ending in '.csv'
glob('data/*.csv')

In [None]:
myList = []
print(myList)
myList.append(1)
print(myList)
myList.append('one')
print(myList)

In [None]:
datasets = []
for path in glob('data/*.csv'): 
    dataset = pd.read_csv(path)
    file = os.path.basename(path)
    subject, modality, condition, run = file.split('_')
    #recall that this syntax is how we assign new columns
    dataset['subject'] = subject 
    dataset['modality'] = modality
    dataset['condition'] = condition
    dataset['run'] = run
    datasets.append(dataset)
    
# Show only the first three elements in the list
datasets[:3]

On the upside, we've now got all our data into Python, including columns for subject ID, modality, condition, and run.  

But Ugh!! This list of dataframes (one dataframe per file) is not a very accessible way to work with the data.

Even though we used the pandas method to read in the datafiles, we read each one into a list that we called *`datasets`*.  So now we just need to use another pandas function **`concat`** to assemble the list of dataframes into one nice single dataframe.  Run the code below and see what you get: 

In [None]:
full_data = pd.concat(datasets)
full_data.head()

This looks a lot better!!  But there's one additional thing that could really help us out when we start analyzing this data, and that's adding one more column that specifies the group of each participant.  In this data set, subject IDs in the 1000's indicate control participants while subject ID's in the 2000's indicate the experimental group. So, let's make a new column in our dataframe called 'group' and assign a value based upon whether the subject ID is greater than or less than 2000.  Run the following code:

In [None]:
full_data['group'] = full_data['subject'] <= 2000
full_data

Wait! This didn't work! It's because when we called `file.split('_')`, it returned a list of strings. In other words, the subject ID column contains strings, not integers. We're trying to compare a string with an integer, e.g.:

    '1013' <= 2000
    
You can see for yourself by inspecting the `dtypes` attribute on `full_data` which tells you the data type of each column (note that `object` *often* means the column is a string).

In [None]:
full_data.dtypes

That won't work. First, we need to convert the subject column from a string to an integer. Fortunately, it's as simple as using the `astype()` method.

In [None]:
full_data['subject'] = full_data['subject'].astype('int')
full_data.dtypes

Now, let's try again.

In [None]:
full_data['group'] = full_data['subject'] <= 2000
full_data.head()

Notice that our new 'group' column is populated with True and False.  This is because when we use the '<=' operator, we are essentially asking "is this value less than 2000?".  It returns "true" if the value is less than 2000 and false otherwise. But, it'd be nice to have the group be more meaningful. Let's assign names.

In [None]:
# rename groups
full_data['group'] = full_data['group'].replace({True: 'Control', False: 'TBI'})
full_data.head()

<a name="#Selecting-and-averaging-subsets-of-data"></a>
# Selecting and averaging subsets of data
<a href="#Overview">Return to overview</a>

<a name="#Exercise-2:-using-masks"></a>
## Exercise 2: using masks
<a href="#Overview">Return to overview</a>
This excercise involves extracting the subsetting and extraction of specific parts of this data set that we're particularily interested in. First, we'll want to pull out all trials of type = 10, as this was our target stimulus that we want to analyze responses to.

Let's go through an example where we will use a mask approach to filter the data by `Type`:



In [None]:
mask = full_data['Type'] == 10
mask

This creates a boolean array. Just to refresh our memories, let's look again at our `full_data` dataframe again to see what `types` are included in the first few rows: 

In [None]:
full_data.head()# the head() method is just an easy way to list the first few rows of the dataframe

It looks like our mask worked. We can see that the first two rows of data, which were *not* type = 10, are coded as `False`, while the third row of data that was type = 10 is correctly coded as `True`. Neat!

But how do we acutally get this data in a useable form? 

Remember the `.loc` attribute from last week?  Here's a little reminder:

    value = dataframe.loc[row_label, column_label]

`.loc` allows us to extract data of interested. Try to use `.loc` to extract and visualize the data using your `mask`. Lable your new data frame `type_10`. 

In [None]:
%load "answers/answer_002.txt"

Great! Now we have a new `dataframe` called 'type_10' with rows that only contain data from our target stimulus.

<a name="#Exercise-3:-using-groupby"></a>
## Exercise 3: using groupby
<a href="#Overview">Return to overview</a>


Next, let's figure out how calculate the percent correct for each participant, modality, and condition. Because the `Correct` column is coded as 0 = Incorrect and 1 = Correct, we can simply take the mean of that whole column to find the average percent correct once we have grouped the data according to participant, modality, and condition.

We learned about the `.groupby` method last week. Here's a little reminder:

    variable_name = dataframe.groupby(['column_name_1'])
    
You can specify more than one column name, e.g.:

    variable_name = dataframe.groupby(['column_name_1', 'column_name_2', ...])
    
We can also perform operations on your new data with syntax like this: 

    variable_name[column_name].mean()

We can use a combination of these methods to specify that we want our data grouped by `subject`, `modality`, and `condition` and that we want to take the mean of `Correct` across these columns of interest.

Take a stab at setting this up and call your new variable `percent_correct`. Don't forget to multiply the mean by 100 to convert from fraction correct to percent correct:

In [None]:
%load "answers/answer_003.txt"

Cool! Now we have percent correct data for each subject, modality, and condition.  And this snippet of data seems to indicate that as the task gets harder from the N0-Back condition to the N2-Back condition, performance becomes poorer. Just what we expected - phew!

<a name="#Exercise-4:-more-masks"></a>
## Exercise 4: more masks
<a href="#Overview">Return to overview</a>

Having the correct/incorrect information coded as 0s and 1s was really handy for calculating percent correct. But in terms of analyzing the reaction time (latency) of the response, we are only interested in analyzing trials that the participant got correct. Try using a mask and the `.loc` attribute again to subset the data further and extract it in a new variable called `type_10_correct`.

In [None]:
%load "answers/answer_004.txt"

<a name="#Exercise-5:-more-groupby"></a>
## Exercise 5: more groupby
<a href="#Overview">Return to overview</a>


Now that we have a dataframe that only contains rows of trials that were answered correctly, let's use the `.groupby` method again to calculate average latency across participants, modalities, and conditions. Call your new variable `mean_latency`.

In [None]:
%load "answers/answer_005.txt"

Perfect! Upon first glance, it looks like our average latency data is following a similar trend as our percent correct data: as the condition gets more difficult, the reaction times become longer. But wouldn't it be easier to compare if we plotted it?....

In [None]:
from matplotlib import pyplot as plt

<a name="#Plotting-the-data"></a>
# Plotting the data
<a href="#Overview">Return to overview</a>

<a name="#Exercise-6:-plotting-mean-latency-by-condition"></a>
## Exercise 6: plotting mean latency by condition
<a href="#Overview">Return to overview</a>

Remember our original research questions:
- Do participants show the expected reduction in accuracy and increase in latency with increasing n?
- Is there a difference in performance accuracy or speed according to sensory modality?
- Is there a difference in accuracy and speed between participant groups, especially for auditory versus visual modalities?

Let's look at the latency data first.  Recall from last week that we can use the .plot() method to make all sorts of figures.  To answer our first question, we want to look at latency as a function of condition.  So, go ahead and calculate the mean latency grouped by condition only, and then plot this information in a bar plot.  

In [None]:
%load "answers/answer_006.txt"

Hey, it does look like latency to increases as the task gets more difficult!  Yay! But let's add some standard error bars to this figure to see if the difference looks significant across conditions.  

Next, find a method to calculate the standard error of the mean latency after grouping by `condition`.  Call this new variable `sem_grouped_latency`, and print the result.<br>

**Hint!** recall that you can type "variable_name." and then press **Tab** to figure out what methods are available for that variable!

In [None]:
%load "answers/answer_007.txt"

Finally, let's replot these data but include error bars this time.  Explore your options for `plot` and try to figure out which variable you need to specifiy the error bars, and then plot the result.  

In [None]:
%load "answers/answer_008.txt"

Wow, those are some tiny error bars!  But it looks like we can definitely answer our first question: YES!  Participants do show the expected increase in reaction time with increasing task difficulty.

<a name="#Exercise-7:-plotting-mean-latency-by-modality"></a>
## Exercise 7: plotting mean latency by modality
<a href="#Overview">Return to overview</a>

Now, let's apply what we just learned and plot the average latency grouped by modality this time, including error bars.

In [None]:
%load "answers/answer_009.txt"

Nice!!  This graph answers our second question: YES, these does appear to be a difference between the two modalities such that participants are faster to respond in the visual modality compared to the auditory modality.
<br>

<a name="#Exercise-8:-plotting-mean-latency-by-group,-modality-and-condition"></a>
## Exercise 8: plotting mean latency by group, modality and condition
<a href="#Overview">Return to overview</a>

Now, to answer our final questions, we need to do some additional grouping.  Here, we want to compare mean latencies across groups, modality and condition.  Let's go ahead and calculated this grouped mean before we get to plotting.  Call the new variable `grouped_latency` and print the result.   

In [None]:
%load "answers/answer_010.txt"

So far, so good.  But now that we're grouping by multiple variables, the data are no longer in a format where the `plot` function can just accept the data and know what to do with it.  Rather, we need need to convert the `condition` column into a row.  We learned how to do this last week!  <br>
**Hint!**  The method you need has seven letters, starts with a u and ends with a k!

In [None]:
%load "answers/answer_011.txt"

OK, the data is now in the correct format for plotting.  Now, let's calculate the SEM for these data and again get it into a form that we can use it for plotting just as we did above. 

In [None]:
%load "answers/answer_012.txt"

Now, let's put it all together and plot the average latency grouped by group, condition, and modality including the standard error bars! 

In [None]:
%load "answers/answer_013.txt"

Next, let's try to move that key out of the way so that we can see our whole figure. This is a little tricky, but here is the code to move the legend.  Spend a little time playing with the various components to see how it changes the location of your figure legend. The key attributes are:

* `loc`: The corner of the legend box that's anchored to the coordinates specified by `bbox_to_anchor`.
* `bbox_to_anchor`: The axes coordinates of the legend box corner specified by `loc`. Axes coordinates run from 0 (minimum along the X or Y axes) to 1 (maximum along the X or Y axes). So, specifying XY coordinates of X=1.05 and Y=1 positions the corner just outside the axes themselves.

In [None]:
axes = grouped_latency.unstack('condition').plot(kind='bar', yerr=sem_grouped_latency_unstacked)
axes.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)

Interesting!!  Both groups look pretty similar on the easiest N0-back condition.  But it looks like the TBI group is a little slower than the controls for the N1 and N2 back condition in the auditory modality, while they may be a little *faster* than the control group in the hardest condition of the visual task.  

<a name="#Exercise-9:-plotting-percent-correct-by-group,-modality-and-condition"></a>
## Exercise 9: plotting percent correct by group, modality and condition
<a href="#Overview">Return to overview</a>

Recall that we're interested in ability to do the test (percent correct) in addition to reaction times.  So, finally, let's make one more plot just like this last one but using percent correct data.  Remember: we want both correct and incorrect responses here - make sure you're using the `type_10` dataframe and plotting the `correct` variable). You can copy/paste the code we used above and change the appropriate column names and variable names.

In [None]:
%load "answers/answer_014.txt"

It looks like the groups are fairly comperable on most conditions except for the N2-Back.  For these conditions, in both auditory and visual modalities, it looks like our TBI folks are performing far below the performance of the controls.  