# Scientific data in Python



## Time series

A time series, from the computer's perspective, is essentially a list of numbers. There are many ways you can "contain" a list of numbers in Python. The simplest is called a


### *list*:

In [None]:
a = [0, .01, .04, .09, .16, .25]
b = [(x/10)**2 for x in range(6)]

One major advantage of lists, is that you can add contents to them:

In [None]:
a.append(0.36)
b += [(x/10)**2 for x in range(6, 10)]

A disadvantage is that you cannot do math with them. In particular

In [None]:
a + b

does not do what you might think. (Try it!)
### Mini-exercise: Math on lists
Write a little code snippet that takes the numbers from the list *a*, multiplies them by 10, and stores the result in a new list *x*.

In [None]:
# Insert your code here


Because math is so important in data analysis, it is attractive to contain sequences of numbers in Python in such a way that math is possible. One such way is the
### *numpy array*:

In [None]:
import numpy as np
y = np.array([0, .01, .04, .09, .16, .25])
z = np.arange(0, .6, .1)**2
w = y + z

## Several time series
In the above example, we had just one time series. What if we have multiple time series, e.g., the temperature record for several cities? Again, we have many choice for how to contain those in Python. The simplest is:


### *dict of lists*:

In [None]:
temps = {
    "LA": [83, 85, 87, 86],
    "San Francisco": [69, 67, 63, 61],
    "Fargo": [43, 47, 84, 45]
}

(A *dict*, short for "dictionary" is what Python calls a bag of labeled items.)

Those time series are nicely labeled, but we cannot easily do math. Slightly nicer perhaps is a
### *dict of arrays*:

In [None]:
temps2 = {
    "LA": np.array([83, 85, 87, 86]),
    "San Francisco": np.array([69, 67, 63, 61]),
    "Fargo": np.array([43, 47, 84, 45])
        }

### Mini-exercise: Math on dicts of arrays
What was the temperature difference between LA and Fargo on each of these four days?

In [None]:
# Insert your code here


Of course, the full power of Python is available, so you could have written:

In [None]:
temps2 = { k: np.array(v) for k, v in temps.items() }
temps2

If that looks like magic, don't worry about it right now, but be sure to find some spare time to educate yourself on Python "comprehensions", for instance here: https://www.geeksforgeeks.org/comprehensions-in-python. It's very useful magic.

If all of the items in your dict are arrays of the same length, it is often attractive to instead put them all in a larger array, and store the labels elsewhere:
### *externally labeled numpy array*:

In [None]:
cities = ["LA", "San Francisco", "Fargo"]
temps = np.array([
    [83, 85, 87, 86],
    [69, 67, 63, 61],
    [43, 47, 84, 45]
])

### Mini-exercise:
Which city had the largest change in temperature?

In [None]:
# Insert your code here




The problem with external labels is that it is far too easy to make mistake and lose track of what is what. If you have labeled data, it is often more attractive to store them in a


### *pandas dataframe*:

In [None]:
import pandas as pd
temps = pd.DataFrame({
    "LA": [83, 85, 87, 86],
    "San Francisco": [69, 67, 63, 61],
    "Fargo": [43, 47, 84, 45]
})
temps

This is not the place to spend a lot of time on all the ways you can interact with data in dataframes. If you haven't already, I suggest taking Justin Bois's course.



## Multi-dimensional time series

We have considered individual timeseries, and loosely related collections of timeseries. What if we had a timeseries where the "values" are not just numbers, but, for instance, images? Just as we created a "CxD" (Cities-by-Days) array temperatures. Equally, you can create a "XxYxT" (X-by-Y-by-Time) array of pixel values in a movie.

Very often, today, we will work with enormous TxE (Time-by-Electrode) arrays of digitized voltages from the 384 electrodes of a Neuropixels probe.

### Mini-exercise: Data ordering
Does it matter whether you organize your data as "XxYxT" vs "TxYxX"? Can you convert between the different organizations?



In [None]:
# Insert your code here


## A note on data types
We have talked a lot about "container" types, i.e., lists vs arrays vs dataframes. The stuff we put inside those containers were always numbers. From our perspective, that's good enough, but the computer  cares a lot about whether those numbers are integers, arbitrary real numbers ("floats"), or complex numbers. You cannot often ignore these distinctions, but every once in a while you get bitten. For instance:

In [None]:
a = np.arange(0, 24)
b = 7 * 2**a
c = b**3
c

This nonsense results from the fact that Python considers *a*, *b*, and *c* as integers, which, on most computers, may not be larger than 2<sup>63</sup> or about 10<sup>19</sup>. If that sounds like a big number, imagine you have a time series of 100,000,000 samples (that's only a couple of hours of recording at a typical 30 kHz sampling rate), and your algorithm needs the third power of time in its calculation. Here is what happens to some of those numbers:

In [None]:
tt = 1001*1002*13*np.arange(10)
tt3 = tt**3
print(tt)
print(tt3)

Bottom line: If you see unexpected negative numbers or zeros where you expect big positive numbers, worry about data types.


### Mini-exercise: Contained data types
1. How do you find out whether Python thinks a number is an integer or a float when it is stored inside a (a) list, (b) numpy array, (c) pandas dataframe?
2. How can you convert one contained data type to another?

In [None]:
# Insert your code here


## Point processes

From the computer's perspective, a point process is also just a list of numbers. The computer doesn't necessarily know that these numbers represent time stamps. As such, they can be stored in lists or arrays, just like time series:

In [None]:
a = [12.02, 13.04, 14.01, 14.98, 16.07]
b = np.array(a)

Very often, a point process may have additional data associated with each recorded time point. For instance, we could have a record of earthquakes, with times and magnitudes, or a record of action potentials, with times and neuron of origin:

In [None]:
tms_s = [1.345, 2.42, 3.34, 3.48, 5.43]
celid =   [3,     2,    45,   2,    17  ]

(I always encode the unit of a physical quantity in the name of the variable. And I abbreviate words like "time" and "cell", if for no other reason that I have a hard time remembering which English words are obscure Python keywords, or the names of important functions.)

Storing such a point process in a data frame is often attractive:

In [None]:
df = pd.DataFrame({"Time_s": tms_s, "CellID": celid})
df

## Multiple point processes

In extracellular physiology, you very commonly encounter simultaneously recorded records from *C* different cells, comprising the times that each cell fired. We already saw that you can store such data in a dataframe, with "time" and "cell ID" as column headings. If we care about individual cells, you could also store this data in a "dict of arrays":

In [None]:
spkt_s = {
    2: np.array([2.42, 3.48]),
    3: np.array([1.345]),
    17: np.array([5.43]),
    45: np.array([3.34])
}

(Of course in practice, such data tend to derive from some algorithm that extracts spike times and putative source identity from raw voltage data, so you would not construct this kind of structure manually in this fashion. We'll see much more of that later today.)