# Tutorial 09, Pandas, series, data frames 
* data structures and data analysis tools

[The official project homepage](https://pandas.pydata.org)

## Basic data structures - start with Series then build up to DataFrames

[Pandas quick start guide for Series](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series)

* A **Series** is a 1D array that can hold any type of data (numeric types, non-numeric, Python objects and so forth).
    * Unlike a 1D numpy array, each entry is **labeled** with an index that is used to keep track of what each entry is, and can be used to lookup the value corresponding to each index during analysis (remember dictionaries?)
    * These labels are fixed - they will always index the same value unless you explicitly break that link.
    * The list of labels that forms the index can either be declared upon series creation or, by default, it will range from 0 to len(data)-1.
        * If you're going to use Pandas to organize your data, specifying usable and informative labels is a good idea because that's one of the main advantages of organizing your data in this manner - if you just want to fly blind then NumPy is usually fine on its own
        

Pandas will allow you to specify non-unique labels. This can be ok for operations that don't rely on indexing by label. However, operations that do rely on unique labels for indexing may throw an unexpected error so in general its good practice to use unique labels!


## Import libs

In [0]:
# standard numpy module
import numpy as np

# import a generic pandas object and also a few specific functions that we'll use
import pandas as pd 

# new - get and store current file path for file i/o later on in tutorial
import os
cwd = os.getcwd()

## Create a series from an numpy ndarray

In [2]:
# make some data and then use pd.Series

# random seed so we get the same thing each time 
np.random.RandomState(0)

# For this simulation, lets have 20 subjects, and some data
# generated from a Rayleigh distribution 
# (no particular motivation for selecting this distribution, just for something different)
# Rayleigh is the distribution of vector magnitudes generated by two independent components (e.g. wind speed)
N = 12
data = np.random.randn(N)

# make a list of subject names for use as index labels
label_prefix = 'Sub'
index=[]
for n in np.arange(N):
    index.append(label_prefix+str(n))    
    
# print our list of index labels
print('Index labels: ', index, '\n')

# then make our pandas Series by passing in our data array and our index labels
s = pd.Series(data, index=index)
print(s)

Index labels:  ['Sub0', 'Sub1', 'Sub2', 'Sub3', 'Sub4', 'Sub5', 'Sub6', 'Sub7', 'Sub8', 'Sub9', 'Sub10', 'Sub11'] 

Sub0    -1.010091
Sub1    -0.393058
Sub2     0.962369
Sub3     0.133389
Sub4    -0.125808
Sub5    -0.336541
Sub6     0.874590
Sub7    -0.429303
Sub8     1.441105
Sub9     0.982347
Sub10   -0.250800
Sub11    0.167832
dtype: float64


## Note that each subject is now a field in the series and can be used to retrieve the corresponding value...there are a few ways to do this

In [3]:
# access by field
print(s.Sub11)

# access by index label
print(s['Sub11'])

# will cover more advanced slicing below

0.16783160573851882
0.16783160573851882


## Can also use labels to check for membership or to index over labels

In [4]:
# check for membership
print('Sub11' in s)

# iterate over index labels, with l==index name
for i in s.index:
    print(i)


# iterate over values...
for v in s.values:
    print(v)    

True
Sub0
Sub1
Sub2
Sub3
Sub4
Sub5
Sub6
Sub7
Sub8
Sub9
Sub10
Sub11
-1.0100914219211745
-0.39305842867805213
0.9623687873760113
0.13338864581868098
-0.125807930572318
-0.33654079660214037
0.8745895903221298
-0.4293032850066475
1.441105470835865
0.9823473898003291
-0.2507999008035929
0.16783160573851882


In [5]:
# can also get to the values more directly like this:
for d in s:
    print(d)

-1.0100914219211745
-0.39305842867805213
0.9623687873760113
0.13338864581868098
-0.125807930572318
-0.33654079660214037
0.8745895903221298
-0.4293032850066475
1.441105470835865
0.9823473898003291
-0.2507999008035929
0.16783160573851882


## Before moving on, there are a few other optional (but important) parameters of the pd.Series call
* dtype - default is to infer the data type (int32, float64, str, etc) based on the values in data
    * However, can also explicitly declare the data
    * This can be good if you want to, for example, re-cast the data to save space or to make types compatible
    * But this may also have important negative consequences if not done thoughtfully! 
* copy - if not specified then the default behavior is set to False and the new series will have a 'view' of the data.
    * This can save space, but can sometimes lead to confusion as any change to the values in s will also change the values in the original 'data' array
    * Setting copy=False will make a new copy of the data in 's' that is independent of the input 'data' array


### Explicitly declare a different dtype to see where things can go wrong

In [6]:
# make a series with the data array from above, but make it int32 instead of the inferred (and correct) float64 type
s = pd.Series(data, index=index, dtype='int32')

# first 4 values in our original data array
print(data[:4])

# first 4 values in our series of type int32...might not be what you want!
print('\n', s[:4])

[-1.01009142 -0.39305843  0.96236879  0.13338865]

 Sub0   -1
Sub1    0
Sub2    0
Sub3    0
dtype: int32


### Another example: declaring dtype can be handy if you want to, for example, do str manipulations with the data array later or if you want to merge with another series of type str
* Note that the dtype of series 's' is now an 'object'. This is the Pandas version of a Python 'str'


In [7]:
# make a series with the data array from above, but this time make it a str
# instead of the inferred float64 type
s = pd.Series(data, index=index, dtype='str')

# first 4 values in our original data array
print(data[:4])

# first 4 values in our series of type str...preserves info and we're now
# all set to do a bunch of str operation without having to deal with 
# recasting each time we interact with the values in s
print('\n', s[:4])

[-1.01009142 -0.39305843  0.96236879  0.13338865]

 Sub0    -1.01009
Sub1   -0.393058
Sub2    0.962369
Sub3    0.133389
dtype: object


### Default on generating new series is to make a view of the data (copy=False)

In [8]:
# Same as before - create a series based on a short data array (0:4 in this case for simplicity)
# let Pandas figure out the dtype, and use the default copy behavior (i.e. copy=False)

N = 4                # number of data points

# make data
data = np.arange(N)

# make index labels
index = ['d1','d2','d3','d4']

# print out the original data array for reference
print('Original data: ', data, '\n')

# make a series with the default behavior of copy=False
s = pd.Series(data, index=index, copy=False)

# print out the new series
print('Original values in series')
print(s)

# now change the value of the first entry in the series
s['d1'] = 100

# new values in series 's'
print('\nNew values in series')
print(s)

# and then print the corresponding entry in the data array
print('\nNew data:', data, '\ndata[0] changed too!')

# Note that data[0] changed because the values in s are a view of data...
# both are referencing the same chunk of memory

Original data:  [0 1 2 3] 

Original values in series
d1    0
d2    1
d3    2
d4    3
dtype: int64

New values in series
d1    100
d2      1
d3      2
d4      3
dtype: int64

New data: [100   1   2   3] 
data[0] changed too!


<div class="alert alert-danger">
Note that this works in the other direction too, which can be more insidious...if you create a Series based on the values in 'data', and then do more work with 'data', then every time you change a value in the original data array, you will also change the corresponding value in s!!!
</div>

In [9]:
# now do the same thing but this time lets explicitly ask for a copy of the data
N = 4                # number of data points

# make data
data = np.arange(N)

# make index labels
index = ['d1','d2','d3','d4']

# print out the original data array for reference
print('Original data: ', data, '\n')

# make a series, but change the default behavior of copy to copy=True
s = pd.Series(data, index=index, copy=True)

# print out the new series
print('Original values in series')
print(s)

# now change the value of the first entry in the series
s['d1'] = 100

# new values in series 's'
print('\nNew values in series')
print(s)

# and then print the corresponding entry in the data array
print('\nNew data is the same as the old data:', data)
print('data[0] did not change because it is independent from values in s')

Original data:  [0 1 2 3] 

Original values in series
d1    0
d2    1
d3    2
d4    3
dtype: int64

New values in series
d1    100
d2      1
d3      2
d4      3
dtype: int64

New data is the same as the old data: [0 1 2 3]
data[0] did not change because it is independent from values in s


## After creating a pandas series, you can do many common operations and access the functionality of other modules 
* A pd Series behaves similar to a NumPy ndarray, and can be passed to many NumPy functions
* Slicing also works like a ndarray - note that index is also sliced
* Lots of built in methods as well that emulate NumPy functionality

### Can pass pd.Series to most NumPy functions... 

In [41]:
# make a new series...
N = 8
data = np.random.exponential(size=N)

# make some labels
label_prefix = 'Exp'

index=[]
for n in np.arange(N):
    index.append(label_prefix+str(n))
    
# make the series
s = pd.Series(data, index=index)

# can pass s to common np operations...
print(s)
print('\nMean: ', np.mean(s), 'Max: ', np.max(s))


Exp0    0.921084
Exp1    0.702152
Exp2    1.782979
Exp3    2.779827
Exp4    0.375859
Exp5    3.462235
Exp6    0.238338
Exp7    0.711948
dtype: float64

Mean:  1.3718025464275176 Max:  3.462234938466547


### Note that the index labels come along for the ride 

In [11]:
# print our series - set of index labels along with data values
print(s)

# then apply the NumPy cumulative product operation (multiply N with N+1, then that result by N+2, etc)
cp = np.cumprod(s)

print('\nCumproduct\n')
print(cp)

# cool part: note that the output also contains the label info, which is handy to keep track of things,
# e.g. you can index into cp using thes labels
print('\nIndex by label')
print(cp['Exp6'])
print(cp.Exp6)

Exp0    1.579368
Exp1    0.241303
Exp2    0.065250
Exp3    0.982524
Exp4    3.217626
Exp5    1.209952
Exp6    0.489658
Exp7    1.777540
dtype: float64

Cumproduct

Exp0    1.579368
Exp1    0.381107
Exp2    0.024867
Exp3    0.024433
Exp4    0.078616
Exp5    0.095121
Exp6    0.046577
Exp7    0.082792
dtype: float64

Index by label
0.046576808093731346
0.046576808093731346


### Series objects have many built in operations, much like NumPy 
[list of attributes and methods](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Series.html)

In [12]:
# attributes
print('Data Type: ', s.dtype)

# basic methods
print('Mean: ', s.mean(), ' Std:', s.std(), 'Max: ', s.max())

# numerical derivative
print('Diff: ', s.diff())

Data Type:  float64
Mean:  1.195402662022865  Std: 1.0216831164483673 Max:  3.217625958850253
Diff:  Exp0         NaN
Exp1   -1.338064
Exp2   -0.176053
Exp3    0.917273
Exp4    2.235102
Exp5   -2.007674
Exp6   -0.720294
Exp7    1.287882
dtype: float64


### Slicing also works like NumPy

In [43]:
# print the series
print(s)
print('\n')

# first 3 values
print('First 3 entries')
print(s[:3])
print('\n')

# another example using more advanced slicing...
another_slice = s[3:-1]    #4th entry to len(s)-1
print(another_slice)

# another example using more advanced slicing...
yet_another_slice = s[s>.5]    #all entries greater than .5
print('\n', yet_another_slice)

Exp0    0.921084
Exp1    0.702152
Exp2    1.782979
Exp3    2.779827
Exp4    0.375859
Exp5    3.462235
Exp6    0.238338
Exp7    0.711948
dtype: float64


First 3 entries
Exp0    0.921084
Exp1    0.702152
Exp2    1.782979
dtype: float64


Exp3    2.779827
Exp4    0.375859
Exp5    3.462235
Exp6    0.238338
dtype: float64

 Exp0    0.921084
Exp1    0.702152
Exp2    1.782979
Exp3    2.779827
Exp5    3.462235
Exp7    0.711948
dtype: float64


The fact that labels stay attached to the corresponding values is often useful beacuse you don't have to compute and store a separate index for the new data set like you would in Matlab if you wanted to keep track of where the values > .9 were in the original array.

## Although series can be treated much like NumPy arrays, there is one key difference (and often a big advantage)
* When you do an operation on a NumPy array, the operation is performed in an element-by-element manner
* However, when you do an operation on two pandas series, the operation will be applied to like-labeled values
* This can save a lot of trouble in terms of lining up corresponding entries in two data arrays when the data sets are initialized in different orders!

In [14]:
# first a quick demo in NumPy just to get re-familiarized with how it works
# make two arrays, and add them
N=5
x = np.arange(N)
y = np.linspace(0,N-1,N, dtype='int32')+10
print(x)
print(y)
print(x+y)

[0 1 2 3 4]
[10 11 12 13 14]
[10 12 14 16 18]


The next part is neat and really really useful in many real world applications where data sets are messy...Series operations are performed based on matching labels, not on matching positions in an array!


### Following on the NumPy example in the last cell...Now suppose that you ran a set of subjects in two experiments, but the data from each subject were entered in a different order in each study
* Even though the data were entered in different orders, you want an easy way to perform operations on specific subjects across experiments 
* Using NumPy - or Matlab - you'd probably now try to sort your second data set so that the labels from the second study were in the same order as in the first study.
* Then you would save an index indicating the sort order, and you'd use that index to rearange the data values from the second data set so that everything lined up with the first data set.
* A series can make life much easier here because operations are done on a union of the labels involved!

In [44]:
# set up two series - as if we have two data sets from the same set of 5 participants
N=5
data0 = np.arange(N)
index0 = ['s0','s1','s2','s3','s4']
s0 = pd.Series(data0, index=index0)

# now do our second 'experiment' but this time the subjects were run in a different order
data1 = np.arange(N)+7
index1 = ['s3','s2','s4','s1','s0']
s1 = pd.Series(data1, index=index1)

# print out our data series
print(s0)
print(s1)

s0    0
s1    1
s2    2
s3    3
s4    4
dtype: int64
s3     7
s2     8
s4     9
s1    10
s0    11
dtype: int64


In [16]:
# Do a simple unary operation like addition across data sets
sum_data = s0+s1
print(sum_data)
# Even though the numerical position of each subject differs across experiments, Pandas figured out how 
# to properly perform the operation by aligning based on index labels!

s0    11
s1    11
s2    10
s3    10
s4    13
dtype: int64


## Last notes on creation of series...
* Thus far we've been initializing series with ndarrays
* Can also make series from scalars (assign all indices same value) or from dicts

In [18]:
# series from scalars
N=4

# don't need repeat cause its a single scalar linked to each index
data = 14
index = np.arange(N) 

# make the series
s = pd.Series(data, index=index)

# all entries will have the same value
s

0    14
1    14
2    14
3    14
dtype: int64

### Can also initialize with a dict
* dict keys become index labels
* data become values

In [19]:
data = {'Bob' : 20, 'Ella' : 17, 'Sam' : 23, 'Jack' : 25.3}
s = pd.Series(data)
print(s)

Bob     20.0
Ella    17.0
Jack    25.3
Sam     23.0
dtype: float64


<div class="alert alert-info">
Note that data type is upcast to highest precision entry when you create a Series with mixed numerical data types
</div>

# Pandas DataFrames 
[The official project homepage](https://pandas.pydata.org)

* Goal
    * Extend what we learned about Series objects in the previous tutorial to their 2D counterpart - DataFrames
    * Develop some tools for dealing with missing data (not exhaustive, but a good start)
    * Take this chance to also learn a bit about file input/output (I/O) and some other more advanced coding techniques

## DataFrames

[Pandas quick start guide for DataFrames](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe)

* A DataFrame (DF) is a labeled data struture that can be thought of as a 2D extension of the Series objects that we discussed in the first part of the tutorial
* A DF can accept many types of input, from a 2D ndarray, multiple Series, a dict of 1D arrays, another DF, etc
* Like a Series, DFs contain data values and their labels. Because we're now dealing with a 2D structure, we call the **row labels the index argument** and the **column labels the column argument**. 
    * Like a Series, if you don't explicitly assign row and column labels, then they will be auto-generated (but not as useful as specifying the labels yourself!)

<div class="alert alert-info">
Much of what we learned about Series objects will generalize to DFs, so here we'll focus on some of key functionality that might not be obvious based on the first part of the tutorial.
</div>

<div class="alert alert-info">
One more quick note: if using an older version of Python (earlier than 3.6) and Pandas (earlier than 0.23) and you create a DF from a dict without explicitly specifying column names, then the column names will be entered into the DF based on lexical order
</div>

## Import libs

In [0]:
# standard numpy module
import numpy as np

# import a generic pandas object and also a few specific functions that we'll use
import pandas as pd 
from pandas import DataFrame, read_csv

# new - get and store current file path for file i/o later on in tutorial
import os
cwd = os.getcwd()

## Make up a data set to demonstrate functionality, will import some real data later on
* Here we'll pretend that we did a unit recording experiment (i.e. recorded data from single neurons during a perceptual experiment)
    * There are two stimulus conditions
    * And we are recording from 4 different neurons 

In [21]:
# seed random number generator so that we're all seeing the same thing in class
np.random.RandomState(0)

<mtrand.RandomState at 0x7f8a3e573ea0>

In [0]:
# index lables for our 10 neurons...see Series tutorial for more elegant ways of generating
# index labels, here we're just going to write them out
neuron_labels = ['Nrn0', 'Nrn1','Nrn2','Nrn3','Nrn4']  

In [0]:
# generate a response to stimulus 1...use random.randint just for practice/fun
min_resp = 0  # inclusive
max_resp = 45 # exclusive
resp1_hz = np.random.randint(min_resp, max_resp, len(neuron_labels))# generate response in each neuron to stimulus 1...

In [0]:
# generate a response to stimulus 2...use random.randint just for practice/fun
min_resp = 0  # inclusive
max_resp = 90 # exclusive
resp2_hz = np.random.randint(min_resp, max_resp, len(resp1_hz))

## New - use 'zip' function to wrap up the data from each list into one list
[reference page for zip](https://www.w3schools.com/python/ref_func_zip.asp)

* Operates just like it sounds  - takes a set of iterators and groups them together into a single iterator with the 1st element in the resultant iterator comprised of the first element of each iterator 'zipped' together, then the second element from each iterator zipped together, etc. 
* Length of resulting iterator limited by the length of the shortest input iterator!

Because the length of the resulting iterator is limited by length of shortest input iterator, you can sometimes not get an error if you try to zip together iterators with unequal lengths - this is fine if intentional, but if the unequal length was caused by a bug, then you may not find it when using zip!


In [25]:
neuron_data = list(zip(resp1_hz, resp2_hz))
print(neuron_data)

print('Grab one index to see the two response arrays zipped together:')
print(neuron_data[3])

[(36, 59), (17, 60), (6, 11), (8, 76), (9, 86)]
Grab one index to see the two response arrays zipped together:
(8, 76)


In [26]:
# note: to unzip go like this and you'll get back the original
uz_data1, uz_data2 = zip(*neuron_data)
print(uz_data1)

(36, 17, 6, 8, 9)


## Make a DataFrame object to hold the contents of the data set
[DataFrame help page](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.html)

* Just like with the pd.Series call, you specify the data, index labels (row labels in this case)
* In addition to row labels, you can also specify column labels (with 'columns')
* Can also specify data type (default is inferred)
* Like pd.Series you can ask for an independent copy of the data (copy=True) or you will get a view by default (i.e. copy=False)

In [27]:
# make the call to pd.DataFrames to create the DF - usage much like pd.Series
df = pd.DataFrame(data = neuron_data, index=neuron_labels, columns = ['stim1', 'stim2'])

# take a look at the output...
display(df)   # compare to print(df) - looks nicer with display thanks to iPython backend 

Unnamed: 0,stim1,stim2
Nrn0,36,59
Nrn1,17,60
Nrn2,6,11
Nrn3,8,76
Nrn4,9,86


In [28]:
# another handy display function...good for large dfs that are too big to fit - 
# at least you can get an idea of the overall structure
df.head()

Unnamed: 0,stim1,stim2
Nrn0,36,59
Nrn1,17,60
Nrn2,6,11
Nrn3,8,76
Nrn4,9,86


## Get a high-level summary of the data using built-in functionality of DataFrame object
[API reference page](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html)

In [29]:
# first call this using the defaults
df.describe()

Unnamed: 0,stim1,stim2
count,5.0,5.0
mean,15.2,58.4
std,12.357184,28.814927
min,6.0,11.0
25%,8.0,59.0
50%,9.0,60.0
75%,17.0,76.0
max,36.0,86.0


In [48]:
# can specify different values to change behavior...
df.describe(percentiles=np.linspace(0,1,11))

Unnamed: 0,stim1,stim2
count,5.0,5.0
mean,15.2,58.4
std,12.357184,28.814927
min,6.0,11.0
0%,6.0,11.0
10%,6.8,30.2
20%,7.6,49.4
30.0%,8.2,59.2
40%,8.6,59.6
50%,9.0,60.0





Important bit of info for avoiding a common source of confusion (and potential bugs!!!)


* Note that if you make a DF out of a set of Series (e.g. a dict of Series), then the resulting DF index labels will be the union of the index labels in each Series
* This can be confusing because the DF will still be formed even if you have mismatching labels or even if you have two series of different sizes..
* Fortunately, the misaligned (or missing) values will be filled in with NaNs ('Not-a-Number') to serve as a placeholder for the misaligned or missing info
* Quick demo below before continuing on with our sample neuron data from above 


In [31]:
# make a set of two Series with unequal lengths stored in a dict, 
# with each Series having data and index labels

# Note that Series 1 has 4 elements, but Series 2 has 5 elements!

data_dict = {'dict0' : pd.Series(data = np.random.randn(4), index=['0','2','3','4']), 
            'dict1' : pd.Series(data = np.random.randint(0,5,5), index=['0','1','2','3','4'])}

# make a data frame
weird_df = pd.DataFrame(data_dict)

# take a look - notice that pd.DataFrame did not throw an error even though
# the input Series are different sizes...however, it did mark the missing value 
# with a NaN
display(weird_df)

Unnamed: 0,dict0,dict1
0,0.067396,2
1,,1
2,-0.978903,4
3,0.503986,2
4,-1.313963,1


Because of the above behavior, often good to frequently check for NaNs in your data to identify processing steps that might have gone awry...(unless you are expecting NaNs as part of routine processing, in which case this might not be very helpful). 
Many ways to do this, but here is one way that works pretty well with identifying the presence of a NaN anywhere in the DataFrame (which is handy beacuse often times the DataFrames are too large to easily see in their entirety) 

In [32]:
# show True/False for each element in DF
# first do it the NumPy way
display(np.isnan(weird_df))

# use the 'any' method to figure out if any entries are NaN
if np.isnan(weird_df).any:
    print('weird...you have NaNs in your data')

Unnamed: 0,dict0,dict1
0,False,False
1,True,False
2,False,False
3,False,False
4,False,False


weird...you have NaNs in your data


### Pandas also provides functions for dealing with NaNs (or missing values)

In [51]:
# return true for NaNs
pd.isna(weird_df)

Unnamed: 0,dict0,dict1
0,False,False
1,True,False
2,False,False
3,False,False
4,False,False


In [34]:
# Or return true for non NaNs
pd.notna(weird_df)  

Unnamed: 0,dict0,dict1
0,True,True
1,False,True
2,True,True
3,True,True
4,True,True


In [35]:
# apply to just one column at a time
# note that you can call the function from the object
# directly 
weird_df['dict0'].notna()

0     True
1    False
2     True
3     True
4     True
Name: dict0, dtype: bool

### Interpolate across missing values...
* Default is linear, but can do all sorts of other interpolations using scipy

In [53]:
interp_df = weird_df.interpolate()
display(weird_df)
display(interp_df)
# note that since the missing value is the last in the series
# it just takes the value of the 2nd to last entry
# otherwise it would be linearly interpolated by default

Unnamed: 0,dict0,dict1
0,0.067396,2
1,,1
2,-0.978903,4
3,0.503986,2
4,-1.313963,1


Unnamed: 0,dict0,dict1
0,0.067396,2
1,-0.455754,1
2,-0.978903,4
3,0.503986,2
4,-1.313963,1


### Can also fill NaNs lots of other ways...like assigning the mean of all or some of the data to NaNs

In [58]:
# e.g. fill the NaN with the mean of the first column!
#weird_df.fillna(weird_df.mean()['dict0'])
weird_df.fillna(weird_df['dict0'].mean())

Unnamed: 0,dict0,dict1
0,0.067396,2
1,-0.430371,1
2,-0.978903,4
3,0.503986,2
4,-1.313963,1


This is just a small subset of methods for finding and dealing with missing data. See this link for a more complete set of possibilities:

[Pandas missing data](https://pandas.pydata.org/pandas-docs/stable/missing_data.html)

Be careful rooting out NaNs using equivalence testing - NaNs don't compare equal (cause they are missing or unknown values). Better to use boolean indexing based on the return vector from np.isnan or pd.isna, etc

Remember that in most cases:
* When summing data, NaNs will be treated as zero.
* Functions like cumsum() and cumprod() ignore NaNs by default, but will propogate them to returned arrays. 
    * To change behaviour to include NaNs, often use skipna=False or similar.
* Typically, the product of NaNs is NaN
    * But if you use pd.Series.prod then NaN * NaN is 1!

In [0]:
# make an array of nans
x = np.repeat(np.nan, 10)
print(x[0]*x[1])  # NaN

# convert x to a series
s = pd.Series(x)

# compute the prod of elements in series
pd.Series.prod(s)

## Indexing, adding, deleting entire columns
* Think of the DF as a dictionary of Series objects with common labels - much of the syntax is the same as for dicts (and for Series)

In [0]:
# grab the second column from our DF
display(df['stim2'])

### Adding a column is easy and can be done dynamically (on the fly)

In [0]:
# define a third response column as the product of the first two columns
df['stim3'] = df.stim1 * df.stim2
display(df)

### Removing columns is also easy and done on the fly... 

In [0]:
# using the del command will delete a column from the DF
# note that here you have to use the df['stim3'] notation
# the df.stim3 notation will not work.
del df['stim3']
print(df)

### Instead of deleting outright, you can also "pop" a column out and asign it to another variable

In [0]:
# define a third response column as the product of the first two columns
df['stim3'] = df.stim1 * df.stim2

# then pop it out
stim3 = df.pop('stim3')
display(stim3)

# now df is back down to just 2 columns
print('\n')
display(df)

## More on indexing and selection of specific coordinates in a DF

### Row selection - this is a bit more confusing as there are many methods
* You can use df.loc to select a row by its label name
* You can use df.iloc to select a row by its integer location 
* You can use boolean vectors to select a set of rows that satisfy some condition
* You can slice rows using standard notation e.g. df[1:3] for rows 1-3
* You can also isolate a particular row/column using a combo of column indexing (see above) and standard slicing notation

Contrary to usual slicing conventions, both the start and the stop indices are included when using the DF.LOC function...see below for demo. This makes sense because you're indexing by label name, not by a zero-based integer index. 


In [0]:
# data from 2nd neuron across both stimulus conditions
df.loc['Nrn1']

In [0]:
# CAREFUL!
# data from 2nd-6th neuron inclusive across both stimulus conditions
df.loc['Nrn1':'Nrn5']

# again, just need to be careful but this makes sense given that you're indexing based on label name (not 0-based counting)

In [0]:
# data from 5th neuron across both stimulus conditions
df.iloc[4]

In [0]:
# data from 2nd-5th neuron across both stimulus conditions
df[2:5]

In [0]:
# can use the trick for returning only a subset of values from a function that we discussed in
# the randomization/bootstrapping lecture:
# here grab the 2nd entry from the 2nd column
print('2nd column, 2nd entry')
print(df['stim2'][1])

# can also go like this
print('2nd column, 4th entry')
print(df.stim2[3])

## Making cooler DataFrame styles (and more useful...although that should take a backseat to coolness)
[Check here for a bunch of neat style options](https://pandas.pydata.org/pandas-docs/stable/style.html)
* Simple demo - can write custom functions that highlight specific aspects of your data - can be very useful for more clearly highlighting/communicating key points in the data within a notebook  

In [0]:
# highlight the max value in each column in yellow...
def highlight_max_value_in_columns(data_frame):
    ind_max = data_frame == data_frame.max()
    return ['background-color: yellow' if i else '' for i in ind_max]


In [0]:
# apply it here!
df.style.apply(highlight_max_value_in_columns)