# Scientific data in Python



## Time series

A time series, from the computer's perspective, is essentially a list of numbers. There are many ways you can "contain" a list of numbers in Python. The simplest is called a


### *list*:

In [1]:
a = [0, .01, .04, .09, .16, .25]
b = [(x/10)**2 for x in range(6)]

In [4]:
a

[0, 0.01, 0.04, 0.09, 0.16, 0.25]

In [3]:
type(a)

list

In [5]:
a[2]

0.04

In [6]:
b

[0.0,
 0.010000000000000002,
 0.04000000000000001,
 0.09,
 0.16000000000000003,
 0.25]

One major advantage of lists, is that you can add contents to them:

In [7]:
a.append(0.36)
b += [(x/10)**2 for x in range(6, 10)]

In [8]:
a

[0, 0.01, 0.04, 0.09, 0.16, 0.25, 0.36]

In [9]:
b

[0.0,
 0.010000000000000002,
 0.04000000000000001,
 0.09,
 0.16000000000000003,
 0.25,
 0.36,
 0.48999999999999994,
 0.6400000000000001,
 0.81]

A disadvantage is that you cannot do math with them. In particular

In [10]:
a + b

[0,
 0.01,
 0.04,
 0.09,
 0.16,
 0.25,
 0.36,
 0.0,
 0.010000000000000002,
 0.04000000000000001,
 0.09,
 0.16000000000000003,
 0.25,
 0.36,
 0.48999999999999994,
 0.6400000000000001,
 0.81]

In [11]:
type(a)

list

does not do what you might think. (Try it!)
### Mini-exercise: Math on lists
Write a little code snippet that takes the numbers from the list *a*, multiplies them by 10, and stores the result in a new list *x*.

In [12]:
# Insert your code here
x = []
for i in range(len(a)):
  x.append(10*a[i])

In [13]:
x

[0, 0.1, 0.4, 0.8999999999999999, 1.6, 2.5, 3.5999999999999996]

In [14]:
x = []
for v in a:
  x.append(10*v)

In [15]:
x

[0, 0.1, 0.4, 0.8999999999999999, 1.6, 2.5, 3.5999999999999996]

In [16]:
x = [ 10*v for v in a]

In [17]:
x

[0, 0.1, 0.4, 0.8999999999999999, 1.6, 2.5, 3.5999999999999996]

In [18]:
def tentimes(a):
  '''TENTIMES - Takes a list of numbers, and produces a new list of numbers, each
  of which is 10 times the value of the corresponding number in the input list.'''
  return [10*v for v in a]

In [19]:
tentimes

<function __main__.tentimes(a)>

In [21]:
tentimes(b)

[0.0,
 0.10000000000000002,
 0.4000000000000001,
 0.8999999999999999,
 1.6000000000000003,
 2.5,
 3.5999999999999996,
 4.8999999999999995,
 6.400000000000001,
 8.100000000000001]

In [22]:
c = [ "Cat", True, [0,1,2,3]]

In [23]:
c

['Cat', True, [0, 1, 2, 3]]

Because math is so important in data analysis, it is attractive to contain sequences of numbers in Python in such a way that math is possible. One such way is the
### *numpy array*:

In [24]:
import numpy as np
y = np.array([0, .01, .04, .09, .16, .25])
z = np.arange(0, .6, .1)**2
w = y + z

In [25]:
y

array([0.  , 0.01, 0.04, 0.09, 0.16, 0.25])

In [26]:
type(y)

numpy.ndarray

In [27]:
type(z)

numpy.ndarray

In [28]:
z

array([0.  , 0.01, 0.04, 0.09, 0.16, 0.25])

In [29]:
w

array([0.  , 0.02, 0.08, 0.18, 0.32, 0.5 ])

In [None]:
# [CxBxTxR] -> [CxBxT] x R -> [CxB] x [TxR] -> [CxR] -> [BxT]

## Several time series
In the above example, we had just one time series. What if we have multiple time series, e.g., the temperature record for several cities? Again, we have many choice for how to contain those in Python. The simplest is:


### *dict of lists*:

In [30]:
temps = {
    "LA": [83, 85, 87, 86],
    "San Francisco": [69, 67, 63, 61],
    "Fargo": [43, 47, 84, 45]
}

In [31]:
type(temps)

dict

In [32]:
temps

{'LA': [83, 85, 87, 86],
 'San Francisco': [69, 67, 63, 61],
 'Fargo': [43, 47, 84, 45]}

In [34]:
temps['Fargo'] - temps['LA']

TypeError: ignored

(A *dict*, short for "dictionary" is what Python calls a bag of labeled items.)

Those time series are nicely labeled, but we cannot easily do math. Slightly nicer perhaps is a
### *dict of arrays*:

In [35]:
temps2 = {
    "LA": np.array([83, 85, 87, 86]),
    "San Francisco": np.array([69, 67, 63, 61]),
    "Fargo": np.array([43, 47, 84, 45])
        }

### Mini-exercise: Math on dicts of arrays
What was the temperature difference between LA and Fargo on each of these four days?

In [None]:
# Insert your code here
temps2["LA"] + temps2["San Francisco"] + temps2["Fargo"]

Of course, the full power of Python is available, so you could have written:

In [38]:
temps2 = { k: np.array(v) for k, v in temps.items() }
temps2

{'LA': array([83, 85, 87, 86]),
 'San Francisco': array([69, 67, 63, 61]),
 'Fargo': array([43, 47, 84, 45])}

If that looks like magic, don't worry about it right now, but be sure to find some spare time to educate yourself on Python "comprehensions", for instance here: https://www.geeksforgeeks.org/comprehensions-in-python. It's very useful magic.

If all of the items in your dict are arrays of the same length, it is often attractive to instead put them all in a larger array, and store the labels elsewhere:
### *externally labeled numpy array*:

In [48]:
cities = ["LA", "San Francisco", "Fargo"]
temps = np.array([
    [83, 85, 87, 86],
    [69, 67, 63, 61],
    [43, 47, 84, 45]
])

### Mini-exercise:
Which city had the largest change in temperature?

In [44]:
# Insert your code here
# Dict style
for city, temp in temps2.items():
    print(city, temp)

LA [83 85 87 86]
San Francisco [69 67 63 61]
Fargo [43 47 84 45]


In [45]:
# Insert your code here
# Dict style
for city, temp in temps2.items():
    print(city, np.diff(temp))

LA [ 2  2 -1]
San Francisco [-2 -4 -2]
Fargo [  4  37 -39]


In [47]:
# Insert your code here
# Dict style
for city, temp in temps2.items():
    print(city, np.max(np.abs(np.diff(temp))))

LA 2
San Francisco 4
Fargo 39


And now for the labeled arrays style

In [49]:
cities

['LA', 'San Francisco', 'Fargo']

In [51]:
temps

array([[83, 85, 87, 86],
       [69, 67, 63, 61],
       [43, 47, 84, 45]])

In [53]:
np.diff(temps, 0)

array([[83, 85, 87, 86],
       [69, 67, 63, 61],
       [43, 47, 84, 45]])

In [54]:
np.diff?

In [55]:
np.diff(temps, 2)

array([[  0,  -3],
       [ -2,   2],
       [ 33, -76]])

In [56]:
np.diff(np.diff(temps))

array([[  0,  -3],
       [ -2,   2],
       [ 33, -76]])

In [60]:
np.max(np.abs(np.diff(temps, axis=1)))

39

In [61]:
np.max?

In [69]:
maxdt = np.max(np.abs(np.diff(temps, axis=1)), axis=1)

In [70]:
maxdt

array([ 2,  4, 39])

In [65]:
for city, dt in zip(cities, maxdt):
  print(city, dt)

LA 2
San Francisco 4
Fargo 39


In [71]:
for i in range(len(cities)):
  print(cities[i], maxdt[i])

LA 2
San Francisco 4
Fargo 39


In [67]:
maxdt = np.max(np.abs(np.diff(temps, axis=1)), axis=0) # Really bad! Do not reuse variable names

array([ 4, 37, 39])



The problem with external labels is that it is far too easy to make mistake and lose track of what is what. If you have labeled data, it is often more attractive to store them in a


### *pandas dataframe*:

In [72]:
import pandas as pd
temps = pd.DataFrame({
    "LA": [83, 85, 87, 86],
    "San Francisco": [69, 67, 63, 61],
    "Fargo": [43, 47, 84, 45]
})
temps

Unnamed: 0,LA,San Francisco,Fargo
0,83,69,43
1,85,67,47
2,87,63,84
3,86,61,45


In [73]:
temps["LA"]

0    83
1    85
2    87
3    86
Name: LA, dtype: int64

This is not the place to spend a lot of time on all the ways you can interact with data in dataframes. If you haven't already, I suggest taking Justin Bois's course.



In [75]:
temps2 = pd.DataFrame({"City": ["LA", "LA", "LA", "SF", "SF", "SF"],
                       "Day": ["Monday", "Tuesday", "Wednesday", "Monday", "Tuesday", "Wednesday"],
                       "Temp": [83, 84, 87, 67, 69, 65]})
temps2

Unnamed: 0,City,Day,Temp
0,LA,Monday,83
1,LA,Tuesday,84
2,LA,Wednesday,87
3,SF,Monday,67
4,SF,Tuesday,69
5,SF,Wednesday,65


In [78]:
Tmonday = temps2[temps2["Day"]=="Monday"]
T0monday = Tmonday["Temp"].mean()

In [79]:
T0monday

75.0

## Multi-dimensional time series

We have considered individual timeseries, and loosely related collections of timeseries. What if we had a timeseries where the "values" are not just numbers, but, for instance, images? Just as we created a "CxD" (Cities-by-Days) array temperatures. Equally, you can create a "XxYxT" (X-by-Y-by-Time) array of pixel values in a movie.

Very often, today, we will work with enormous TxE (Time-by-Electrode) arrays of digitized voltages from the 384 electrodes of a Neuropixels probe.

### Mini-exercise: Data ordering
Does it matter whether you organize your data as "XxYxT" vs "TxYxX"? Can you convert between the different organizations?



In [None]:
# Insert your code here


## A note on data types
We have talked a lot about "container" types, i.e., lists vs arrays vs dataframes. The stuff we put inside those containers were always numbers. From our perspective, that's good enough, but the computer  cares a lot about whether those numbers are integers, arbitrary real numbers ("floats"), or complex numbers. You cannot often ignore these distinctions, but every once in a while you get bitten. For instance:

In [80]:
a = np.arange(0, 24)
b = 7 * 2**a
c = b**3
c

array([                 343,                 2744,                21952,
                     175616,              1404928,             11239424,
                   89915392,            719323136,           5754585088,
                46036680704,         368293445632,        2946347565056,
             23570780520448,      188566244163584,     1508529953308672,
          12068239626469376,    96545917011755008,   772367336094040064,
        6178938688752320512, -5908722711110090752,  8070450532247928832,
       -9223372036854775808,                    0,                    0])

In [81]:
a = np.arange(0, 24)
b = 7.0 * 2**a
c = b**3
c

array([3.43000000e+02, 2.74400000e+03, 2.19520000e+04, 1.75616000e+05,
       1.40492800e+06, 1.12394240e+07, 8.99153920e+07, 7.19323136e+08,
       5.75458509e+09, 4.60366807e+10, 3.68293446e+11, 2.94634757e+12,
       2.35707805e+13, 1.88566244e+14, 1.50852995e+15, 1.20682396e+16,
       9.65459170e+16, 7.72367336e+17, 6.17893869e+18, 4.94315095e+19,
       3.95452076e+20, 3.16361661e+21, 2.53089329e+22, 2.02471463e+23])

This nonsense results from the fact that Python considers *a*, *b*, and *c* as integers, which, on most computers, may not be larger than 2<sup>63</sup> or about 10<sup>19</sup>. If that sounds like a big number, imagine you have a time series of 100,000,000 samples (that's only a couple of hours of recording at a typical 30 kHz sampling rate), and your algorithm needs the third power of time in its calculation. Here is what happens to some of those numbers:

In [82]:
tt = 1001*1002*13*np.arange(10)
tt3 = tt**3
print(tt)
print(tt3)

[        0  13039026  26078052  39117078  52156104  65195130  78234156
  91273182 104312208 117351234]
[                   0  3236350710934915656  7444061613769773632
 -4852251173305035368  4212260689029534208 -1284530754745678552
 -1924521239021179712  3263649428102973048 -3195402635182829568
 -1883573163269093624]


Bottom line: If you see unexpected negative numbers or zeros where you expect big positive numbers, worry about data types.


### Mini-exercise: Contained data types
1. How do you find out whether Python thinks a number is an integer or a float when it is stored inside a (a) list, (b) numpy array, (c) pandas dataframe?
2. How can you convert one contained data type to another?

In [91]:
# Insert your code here
lst = list(a)
print(type(a), type(lst), type(temps2))

<class 'numpy.ndarray'> <class 'list'> <class 'pandas.core.frame.DataFrame'>


dtype('int64')

In [92]:
a.dtype

dtype('int64')

In [93]:
type(a[0])

numpy.int64

In [95]:
lst.dtype

AttributeError: ignored

In [96]:
type(lst[0])

numpy.int64

In [99]:
temps2["Temp"].dtype

dtype('int64')

## Point processes

From the computer's perspective, a point process is also just a list of numbers. The computer doesn't necessarily know that these numbers represent time stamps. As such, they can be stored in lists or arrays, just like time series:

In [100]:
a = [12.02, 13.04, 14.01, 14.98, 16.07]
b = np.array(a)

Very often, a point process may have additional data associated with each recorded time point. For instance, we could have a record of earthquakes, with times and magnitudes, or a record of action potentials, with times and neuron of origin:

In [101]:
tms_s = [1.345, 2.42, 3.34, 3.48, 5.43]
celid =   [3,     2,    45,   2,    17  ]

(I always encode the unit of a physical quantity in the name of the variable. And I abbreviate words like "time" and "cell", if for no other reason that I have a hard time remembering which English words are obscure Python keywords, or the names of important functions.)

Storing such a point process in a data frame is often attractive:

In [102]:
df = pd.DataFrame({"Time_s": tms_s, "CellID": celid})
df

Unnamed: 0,Time_s,CellID
0,1.345,3
1,2.42,2
2,3.34,45
3,3.48,2
4,5.43,17


## Multiple point processes

In extracellular physiology, you very commonly encounter simultaneously recorded records from *C* different cells, comprising the times that each cell fired. We already saw that you can store such data in a dataframe, with "time" and "cell ID" as column headings. If we care about individual cells, you could also store this data in a "dict of arrays":

In [103]:
spkt_s = {
    2: np.array([2.42, 3.48]),
    3: np.array([1.345]),
    17: np.array([5.43]),
    45: np.array([3.34])
}

### Thought exercise

Could you store this in a single numpy array, where the columns are cells and the rows are ...?

(Of course in practice, such data tend to derive from some algorithm that extracts spike times and putative source identity from raw voltage data, so you would not construct this kind of structure manually in this fashion. We'll see much more of that later today.)