In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib notebook

# This Laptop Is Inadequate:
# An Aperitif for DSFP Session 8

**Version 0.1**

By AA Miller 2019 Mar 24

When I think about LSST there are a few numbers that always stick in my head: 

  -  37 billion (the total number of sources that will be detected by LSST)
  -  10 (the number of years for the baseline survey)
  -  1000 (the, approximate, number of observations per source)
  -  37 trillion ($37 \times 10^9 \times 10^4$ = the total number of source observations)

These numbers are *eye-popping*, though the truth is that there are now several astronomical databases that have $\sim{10^9}$ sources (e.g., PanSTARRS-1, which we will hear more about later today). 

A pressing question, for current and future surveys, is: how are we going to deal with all that data?

If you're anything like me - then, you love your laptop. 

And if you had it your way, you wouldn't need anything but your laptop for anything that you have ever worked on.

But is that practical?

## Problem 1) The Inadequacy of Laptops

**Problem 1a**

Suppose you could describe every source detected by LSST with a single number. Assuming you are on a computer with a 64 bit architecture, to within an order of magnitude, how much RAM would you need to store every LSST source within your laptop's memory?

*Bonus question* - can you think of a single number to describe every source in LSST that could produce a meaningful science result?

*Take a minute to discuss with your partner*

$$\frac{64 \, \mathrm{bit}}{1 \, \mathrm{source}} \times \frac{1 \,\mathrm{GB}}{8\times10^9 \,\mathrm{bit}} \times 3.7 \times 10^{10}\, \mathrm{sources} \approx 296 \, \mathrm{GB}$$

While there are specialized machines that have this amount of memory, this is unreasonable for any modern laptops. 

As for a single number to perform useful science, I can think of two. 

First - you could generate a [heirarchical triangular mesh](http://www.skyserver.org/HTM/) with enough trixels to characterize every LSST resolution element on the night sky. Then you could assign a number to each trixel, and describe the position of every source in LSST with a single number. Under the assumption that every source detected by LSST is a galaxy, this is not a terrible assumption, you could look at the clustering of these positions to (potentially) learn things about structure formation or galaxy formation (though without redshifts you may not learn all that much).

The other number is the flux (or magnitude) of every source in a single filter. Again, under the assumption that everything is a galaxy, the number counts (i.e. a histogram) of the flux measurements tells you a bit about the Universe. 

It probably isn't a shock that you won't be able to analyze every individual LSST source on your laptop.

But that raises the question - how should you analyze LSST data?

  -  By buying a large desktop?
  -  On a local or national supercomputer?
  -  In the cloud?
  -  On computers that LSST hosts/maintains?

But that raises the question - how should you analyze LSST data?

  -  By buying a large desktop? (impractical to ask of everyone working on LSST)
  -  On a local supercomputer? (not a bad idea, but not necessarily equitable)
  -  In the cloud? (AWS is expensive)
  -  On computers that LSST hosts/maintains? (probably the most fair, but this also has challenges)

We will discuss some of these issues a bit later in the week...

## Problem 2) Laptop or Not You Should Be Worried About the Data

### Pop quiz

We will now re-visit a question from a previous session:

**Problem 2a**

What is data?

*Take a minute to discuss with your partner*

**Solution 2a**

Data are constants.

This leads to another question: 

Q - What is the defining property of a constant?

A - They don't change.

If data are constants, and constants don't change, then we should probably be sure that our data storage solutions do not alter the data in any way. 

Within the data science community, the python [`pandas`](https://pandas.pydata.org/) package is particularly popular for reading, writing, and manipulating data (we will talk more about the utility of `pandas` later). 

The `pandas` docs state the `read_csv()` method is the [workhorse function for reading text files](http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#csv-text-files). Let's now take a look at how well this workhorse "maintains the constant nature of data". 

**Problem 2b**

Create a `numpy` array, called `nums`, of length 10000 filled with random numbers. Create a `pandas` `Series` object, called `s`, based on that array, and then write the `Series` to a file called `tmp.txt` using the `to_csv()` method.

*Hint* - you'll need to name the `Series` and add the `header=True` option to the `to_csv()` call.

In [128]:
nums = np.random.rand(10000)
s = pd.Series(nums, name='nums')
s.to_csv('tmp.txt', header=True, index=False)

**Problem 2c**

Using the `pandas` `read_csv()` method, read in the data to a new variable, called `s_read`. Do you expect `s_read` and `nums` to be the same? Check whether or not you expectations are correct. 

*Hint* - take the sum of the difference not equal to zero to identify if any elements are not the same.

In [129]:
s_read = pd.read_csv('tmp.txt')

sum(nums - s_read['nums'].values != 0)

2302

So, it turns out that $\sim{23}\%$ of the time, `pandas` does not in fact read in the same number that it wrote to disk.

The truth is that these differences are quite small (see next slide), but there are many mathematical operations (e.g., subtraction of very similar numbers) that may lead these tiny differences to compound over time such that your data are not, in fact, constant.

In [130]:
print(np.max(np.abs(nums - s_read['nums'].values)))

2.220446049250313e-16


So, what is going on?

Sometimes, when you convert a number to ASCII (i.e. text) format, there is some precision that is lost in that conversion. 

How do you avoid this?

One way is to directly write your files in binary. To do so has serveral advantages: it is possible to reproduce byte level accuracy, and, binary storage is almost always more efficient than text storage (the same number can be written in binary with less space than in ascii). 

The downside is that developing your own procedure to write data in binary is a pain, and it places strong constraints on where and how you can interact with the data once it has been written to disk. 

Fortuantely, we live in a world with `pandas`. All this hard work has been done for you, as `pandas` naturally interfaces with the `hdf5` binary table format. (You may want to also take a look at `pyTables`)

**Problem 2d**

Repeat your procedure from above, but instead of writing to a csv file, use the `pandas` `to_hdf()` and `read_df()` method to see if there are any differences in `s` and `s_read`.  

*Hint* - You will need to specify a name for the table that you have written to the `hdf5` file in the call to `to_hdf()` as a required argument. Any string will do.

*Hint 2* - Use `s_read.values` instead of `s_read['nums'].values`.

In [131]:
s.to_hdf('tmp.h5', 's', header=True)
s_read = pd.read_hdf('tmp.h5', 's')

sum(nums - s_read.values != 0)

0

So, if you are using `pandas` anyway, and if you aren't using `pandas` I strongly suggest you make it part of your normal workflow, then I strongly suggest removing csv files from your workflow to instead focus on binary hdf5 files. This requires typing the same number of characters, but it ensures byte level reproducibility. 

And reproducibiliy is the pillar upon which the scientific method is built. 

Is that the end of the story? ... No.

In the previous example, I was being a little tricky in order to make a point. It *is* in fact possible to create reproducible csv files with `pandas`. By default, `pandas` sacrifices a little bit of precision in order to gain a lot more speed. If you want to ensure reproducibility then you can specify that the `float_precision` should be `round_trip`, meaning you get the same thing back after reading from a file that you wrote to disk. 

In [132]:
s.to_csv('tmp.txt', header=True, index=False)

s_read = pd.read_csv('tmp.txt', float_precision='round_trip')

sum(nums - s_read['nums'].values != 0)

0

So was all of this in service of a lie?

No. What I said before remains true - text files do not guarantee byte level precision, and they take more space on disk. But there are some advantages to using text files: 

  -  anyone, anywhere, on essentially any platform can easily inspect and deal with text files
  -  text files can be easily inspected (and corrected) if necessary
  -  special packages are needed to read/write in binary
  -  binary files, which are not easily interpretable, are difficult to use in version control (and banned by some version control platforms)

To summarize, here is my advice: think of binary as your (new?) default for storing data.

But, as with all things, consider your audience: if you are sharing/working with people that won't be able to deal with binary data, or, you have an incredibly small amount of data, csv (or other text files) should be fine.

## Problem 3) But what really matters is organization

In [None]:


# a = '0.3066101993807095471566981359501369297504425048828125'
a = np.pi*1e19
print(a)
with open('tmp.txt','w') as fw:
    print('{:.16f}'.format(a), file=fw)
with open('tmp.txt') as f:
    b = f.readline()

print(float(b) - a)

float(b)
    
    
# for i in range(10000):
#     with open('tmp.txt') as f:
#         ha = f.readline()
#     with open('tmp.txt','w') as fw:
#         ha = np.float(ha)
#         print(ha, file=fw)

In [66]:
ha = np.random.rand(100000)
# with open('tmp.txt','w') as fw:
#     for h in ha:
#         print(h, file=fw)
np.savetxt('tmp.txt', ha)
b = np.loadtxt('tmp.txt')

In [67]:
sum(ha - b)

0.0

In [36]:
with open('tmp.txt','w') as fw:
    print(np.pi)
    print(14, file=fw)

for i in range(100):
    with open('tmp.txt') as f:
        ha = f.readline()
    with open('tmp.txt','w') as fw:
        ha = np.float(ha)
        print(ha, file=fw)

3.141592653589793


In [37]:
ha

14.0

In [33]:
ha[0]

14.00000001765