# Exercise 3: Importing and manipulating data

In the exercise, we're going to learn how to play with some tools for importing and maipulating data sets. We'll focus on a couple of key file types:

* Hierarchical Data Format 5 (HDF5): A format for saving large data sets in a "hierarchical" format like folders on your computer. In addition to storing the data itself in an efficient way, this format allows for metadata about the datasets to be embedded in the file itself. [HDF5 website](https://www.hdfgroup.org)
* Comma Seperated Value (CSV): Simply a spreadsheet stored as a text file.
* JavasSript Object Notation (JSON): A format for saving key value pairs (akin to a Python dictionary). Can open it in any text editor like CSV. [JSON website](https://www.json.org/)

Beyond talking about ways to open and access these files directly in python via the standard libraries and numpy, we'll also be talking about two really powerful tools for working with data in python:

* Pandas: a fast, powerful, flexible and easy to use open source data analysis and manipulation tool (reproduces much of the functionality of R). [Pandas website](https://pandas.pydata.org)
* Xarray:  an open source projectfor working with labelled multi-dimensional arrays. [Xarray website](http://xarray.pydata.org/en/stable/)

One great place to get data is 

* [Collaborative Research in Computational Neuroscience - Data repository](crcns.org): Great resource for published data sets. Need to request access but it is open.  

Good to practice with a complete data set _and_ see how other people store, organize, & archive their data.

## Really important stuff we probably won't have time for

* [Neurodata Without Borders (NWB)](https://www.nwb.org): New open standard for storing all kinds of neuroscience data that should be interoperable between tools. 
* The [Allen Brain Observatory software development kit (allensdk)](https://allensdk.readthedocs.io/en/latest/install.html) has a lot of really powerful tools as well as access to their data. v

In [None]:
import numpy as np

## 3.1 labeled arrays using `xarray` <a name="xarray">

Very often arrays are objects that have coordinates associated with each of the dimesions (like time or space), but numpy doesn't have a method for storing those coordinates with the arrays. This can get cumbersome and basic operations (like interpolating between two coordinates) becomes more difficult than you would want. `xarray` is a package designed to address these deficiencies in `numpy`. Two main obects `DataArray` and a `DataSet`. 

[The docs for `xarray` can be found here](http://xarray.pydata.org/en/stable/index.html)

In [None]:
import xarray as xr

### 3.1.1 Constructing a `DataArray`

In [None]:
# Defining some coordinates 
time = np.linspace(0, 20, 401) # in seconds
space = np.linspace(-10, 10, 1001) # in centimeters


In [None]:
# Defining some function of space and time
def height_of_wave(t, x, wavelength=2, speed=.1):
    return(np.sin(2*np.pi*(x - speed*t) / wavelength))

In [None]:
# creating arrays to input into function
tv, sv = np.meshgrid(time, space)

# calculating values of function at each value of `space` and `time`
height = height_of_wave(tv, sv)
height

In [None]:
height.shape

In [None]:
# Cheating a bit to make sure this works...
import matplotlib.pyplot as plt
from matplotlib import cm

plt.imshow(height, cmap="jet")
plt.colorbar()

Ok, so now we have three objects that all go together
1. The actual data: `height`
2. An array representing the `x` coordinate `time`
3. An array representing the `y` coordinate `space`

**and** an implicit set of units `s` & `cm`. 

_Wouldn't be nice if there were a way to stroe these in one object? There is! `xr.DataArray`_

The syntax for `DataArray` is

`xr.DataArray(DATA, dims=TUPLE_OF_DIM_NAMES, coords=DICT_OF_COORDINATE_NAMES_AND_VALUES)`

Note that 

1. The order of the `dims` matters. The first named dim corresponds to axis 0 in your array, the second corresponds to axis 1, etc
2. the coords keyword takes a dictionary where the keys are the dims that have been named and the coordinate values as values.

In [None]:
# Did I do this right?

xr_height = xr.DataArray(height, dims=("x", "t"), coords={"x": space, "t": time})

In [None]:
# Look how pretty the readout is when I ask jupyter about `xr_height`

xr_height

In [None]:
# What are `attributes` all about? A dictionary where you can store whatever additional information you want.

xr_height.attrs["year"] = 2020

In [None]:
xr_height

`xarray` has two special variables that it uses:

* `long_name`: a long name for the DataArray
* `units`: self-explanatory but note that the DataArray _and_ the coordinates have seperate attributes where you can set each of their units. 

In [None]:
# Setting some key attributes

xr_height.attrs["long_name"] = "Height of a travelling wave"
xr_height.attrs["units"] = "mm"
xr_height.x.attrs["units"] = "cm"
xr_height.t.attrs["units"] = "s"

In [None]:
# Check to see you can find the metadata...
xr_height

### 3.1.2 Accessing data array elements

Having all the coordinate info included with the object means accessing relevant info is much easier. For example if we only had the original trio of numpy arrays and we wanted to find the value of the `height` from the array when `space` = 4 and `time` = 2 we would need to use all three objects: 

In [None]:
# Finding the value of `height` when `space` = 4 cm and `time` = 2 s

height[space == 4, time == 2]

In [None]:
height[(space>0)&(space<4), time == 2]

If you had multiple arrays floating around with their own `space` and `time` this would get challenging fast.

To access `DataArray` information you can use multiple methods.

1. index the array by position and integer label (just like in numpy)
2. `.isel` by dimentional name and integer label
3. `.loc` by positional and coordinate label
4. `.sel` by dimensional name and coordinate label

In [None]:
# Going back to the example before when `space` = 4 and `time` = 2.
# What location is this in the array?

print(np.where(space == 4))
print(np.where(time == 2))

In [None]:
# Ok it is at the index integers 700, 40. So we could pick out that place and time from  `xr_height`
# Should see the same value as above pop out. 

xr_height[700, 40]

In [None]:
# Now `.sel` also uses these integer indexes. Note: isel is a function included
# with a `DataArray`. It takes keyword arguments that are the dim names.
# Note that I can put these in any order I want and the reader can explicitly 
# see what 40 and 700 refer to. 

xr_height.isel(t=40, x=700)

In [None]:
# Now the really cool part is skipping the whole step of finding the index and 
# just using the coordinate values themselves. Analgous to the above we have
# two ways of doing this: without keywords and with. Without keywords is the
# `.loc` function. Like traditional indexing it uses subscripting (i.e. [ ])

xr_height.loc[4, 2]

In [None]:
# Similarly, `.sel` is a function allows you to specify the keywords explictly

xr_height.sel(t=2, x=4)

In [None]:
# You can even slice **using the coordinate values**. This slice takes all
# values of x from 0 to 4 and all values of t from 0 to 2

xr_height.loc[0:4, 0:2]

### 3.1.3 Interpolating 

In [None]:
# What happens when you ask one of these magic coordinate methods 
# to pull a value that's not actually in the set of coordinates?

xr_height.loc[4, np.pi]

In [None]:
# You can use loc to pick out the nearest value even if its not in the coords

xr_height.sel(x=4.17263, t=np.pi, method="nearest")

In [None]:
# Ouch! Python gave us a `KeyError`. That's fair, after all our coordinates 
# didn't have that key. There's another function which will use surrounding 
# data and interpolate from that data to the point we want: `.interp` 
# it follows same conventions as `.sel` in that you have to specify the names 
# of the coordinates.

xr_height.interp(x=4, t=np.pi)

In [None]:
# Checking to see if this is reasonable...

xr_height.loc[4, 3.1:3.2]

### 3.1.4 Math with `DataArray` objects

`DataArray` objects have many of the same standard functions `np.array` objects do, but knowledge of te coordinates makes life easier. We'll look at a few examples

1. Univariate operations (like `.mean()`) on a index no longer require you to remember which variable was axis 0. You can just use the coordinate names to specify which dimension you want to averate over. 
2. Bivariate operations (like `*`) will automatically do what ever makes sense based on whether the axes line up or not! 

In [None]:
# Taking the overall mean works as before...

xr_height.mean()

In [None]:
# Averaging over the x direction only:

xr_height.mean(dim='x')

To think about the math we're going to create a new DataArray that only has a spatial dimension but no time. 

In [None]:
# creating a linear offset

height_offset = xr.DataArray(.05*space, dims=("x",), coords={"x": space})

In [None]:
height_offset

In [None]:
xr_height

In [None]:
# What should adding these two do? Adding the arrays would crash and burn 
# because they have different numbers of dimension

new_height = xr_height + height_offset
new_height

In [None]:
# Huh? This worked? What did it do? 

# Here's the original
plt.imshow(xr_height, cmap='jet')
plt.colorbar()

In [None]:
# Here's the new one.
plt.imshow(new_height, cmap='jet')
plt.colorbar()

**What did `+` do?**

It added the offset to the `x` dimension and did it for each `t`! This makes sense!

There is a lot more here, but the key is that math can be simpler. For example, it uses the coordinate information for proper matrix multiplication:

In [None]:
# Note that `x` is the shared dimension so it should disappear after matrix
# multiplication. 

xr_height @ height_offset

### 3.1.5 Loading / saving `xarray` objects: NetCDF, JSON, Pickle

`xarray` natively saves to the [NetCDF (Network Common Data Format)](https://www.unidata.ucar.edu/software/netcdf/docs/faq.html#What-Is-netCDF). This is a format that was originally created for climate science. Under the hood it is using a more general format which we will talk more about [Hierarchical Data Format (HDF)](https://portal.hdfgroup.org/display/HDF5/HDF5).

Can convert a `DataArray` to a dictionary and save it using pythons `pickle` package or as a `JSON` file. `JSON` stands for a  JavasSript Object Notation (JSON) it's a text format for saving key value pairs (akin to a Python dictionary). [JSON website](https://www.json.org/)

Can also convert it to `pandas` variables (something we'll talk about a bit later), but pandas knows how to save to **many** other formats. 

In [None]:
# Option 1, Saving the array as a NetCDF file...
# xr_height.to_netcdf("data/height.nc"), format='NETCDF4', engine='netcdf4')

import os

xr_height.to_netcdf(os.path.join("data", "height.nc"), format='NETCDF4', engine='netcdf4')

In [None]:
# ASIDE: It's secretly an HDF5 file
# The 'r' here stands for 'read'. Could have just as easily been 'w' for write or 'a' for append. 

import h5py

with h5py.File(os.path.join("data", "height.nc"), 'r') as h5_height:
    for name in h5_height:
        print(name)

In [None]:
# ...and checking that this worked.

xr.open_dataarray(os.path.join("data", "height.nc"))

In [None]:
# Option 2, converting to a dictionary...

dict_height = xr_height.to_dict()

In [None]:
dict_height.keys()

In [None]:
# ...saving dictionary as a pickle...
import pickle

with open(os.path.join("data", "height.pickle"), 'wb') as f:
    pickle.dump(dict_height, f)

In [None]:
# ...and checking whether it worked. 

with open(os.path.join("data", "height.pickle"), 'rb') as f:
    dict_height_2 = pickle.load(f)

In [None]:
dict_height_2.keys()

In [None]:
# Option 3, converting to a dictionary and saving as JSON

import json

with open(os.path.join("data", "height.json"), 'w') as f:
    json.dump(dict_height, f)

In [None]:
with open(os.path.join("data", "height.json"), 'r') as f:
    dict_height_2 = json.load(f)

In [None]:
dict_height_2.keys()

## 3.2 HDF5 using `h5py`

Heirarchical data format is a fomat for storing data in groups with appended metadata. [The docs for h5py can be found here.](https://docs.h5py.org/en/stable/)

In [None]:
import h5py

### 3.2.1 Loading HDF5 files

In [None]:
# importing the HDF5 file into python as `data`.
# The 'r' here stands for 'read'. Could have just as easily been 'w' for write or 'a' for append. 

data = h5py.File(os.path.join("data", "dataset_2017_08_25_postrun", "2017-08-25_09-50-43.hdf5"), 'r')

In [None]:
# To see all the data use the .visit function 

data.visit(print)

### 3.1.2 Navigating HDF5 files <a name="HDF">

In [None]:
spikes = data["ephys/TT1/spikes/times"]

In [None]:
# What's going on here?

spikes

In [None]:
# Need to cast to array

spikes_arr = np.array(spikes)

In [None]:
spikes_arr

In [None]:
import matplotlib.pyplot as plt

plt.scatter(spikes_arr[:100], 100*[1], c='k', marker='|')

### 3.2.3 Writing to HD5 files

In [None]:
# `a` = append

h5file = h5py.File(os.path.join("data", "my_data.h5"), mode="a")

In [None]:
# Creating a group

group = h5file.create_group("/cell_1")

In [None]:
# Making some fake data 

time = np.linspace(0, 100, 1000)
voltage = np.random.rand(1000)

In [None]:
# saving these as arrays in the hdf5 file

group.create_dataset("voltage", data=voltage)
group["voltage"].attrs["units"] = "mV"
group.create_dataset("time", data=time)
group["time"].attrs["units"] = "s"


In [None]:
h5file.visit(print)

In [None]:
# Don't forget to close a file when you are done!

h5file.close()

## 3.2(alt) HDF5 using `tables` <a name="HDF5alt">

The `pytables` package is a  robust way to interact with HDF5 files. [You can find the package website here](https://www.pytables.org).

In [None]:
import tables

### 3.2.1 Loading HDF5 files

In [None]:
# importing the HDF5 file into python as `data`.
# The 'r' here stands for 'read'. Could have just as easily been 'w' for write or 'a' for append. 

data = tables.File("data/dataset_2017_08_25_postrun/2017-08-25_09-50-43.hdf5", 'r')

In [None]:
# To have tables spit out the whole kit and caboodle  

data

In [None]:
data["ephys/TT24/spikes"].visit(print)

### 3.1.2 Navigating HDF5 files

In [None]:
data.root.?

In [None]:
spikes = data.root.ephys.TT1.spikes.times

In [None]:
plt.scatter(spikes[:100], 100*[1], s='|')

### 3.2.3 Writing to HD5 files

In [None]:
h5file = tables.open_file("data/my_data.h5", mode="a", title="My test file")

In [None]:
# Creating a group

group = h5file.create_group("/", 'cell_1', 'Data from cell_1')

In [None]:
# Making some fake data 

time = np.linspace(0, 100, 1000)
voltage = np.random.rand(1000)

In [None]:
# saving these as arrays in the hdf5 file

h5file.create_array(group, 'time', time, "time (ms)")
h5file.create_array(group, 'voltage', voltage, "voltage (mV)")

In [None]:
h5file

In [None]:
# Can also create tables with columns of different types
# Definiing the information you'd like in your table, and creating an empty table

class my_ephys(tables.IsDescription):
    time  = tables.Float32Col()      # Signed 64-bit integ
    voltage = tables.Float32Col()
    
table = h5file.create_table(group, 'my_data', my_ephys, "recording session")

In [None]:
moment = table.row
for i, j in zip(time, voltage):
    moment['time']  = i
    moment['voltage'] = j
    moment.append()

table.flush()

In [None]:
h5file

In [None]:
# Saving metadata





## 3.3 DataFrames from `pandas` <a name="pandas">

We're going to be working through [the `pandas` tutorials here.](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/index.html)

In [None]:
import pandas as pd

### 3.3.1 Creating a `DataFrame` object 

In [None]:
# Option 1: Feed `DataFrame` a dictionary where keys are columns and values are a list of
# values

dict_data = {"Name": ["Braund, Mr. Owen Harris",  "Allen, Mr. William Henry", "Bonnell, Miss. Elizabeth"],
        "Age": [22, 35, 58], "Sex": ["male", "male", "female"]}

df = pd.DataFrame(dict_data)

In [None]:
df

In [None]:
# Option 2, a list of dictionary objects, note what it does for the missing 
# "sex" for "Owen"

option_2 = [{"Name": "Owen", "Age": 22}, {"Name": "William", "Age": 35, "Sex":'male'}]
pd.DataFrame(option_2)

In [None]:
# Option 3, just go directly to a dataframe from another object type...

xr_height.to_pandas()

### 3.3.2 Working through the [pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/index.html)