# NumPy arrays and C++ binary IO

cf. Peter Gottschling.  **Discovering Modern C++: An Intensive Course for Scientists, Engineers, and Programmers**, A.2.7 Binary I/O.  

cf. [Writing binary in c++ and read in python](https://stackoverflow.com/questions/37503346/writing-binary-in-c-and-read-in-python)

In [1]:
import numpy
import numpy as np

In [2]:
# find out where we are in the file directory
import os, sys

In [3]:
print(os.getcwd())
datafilefolder = "./data/"

/home/topolo/PropD/CompPhys/Cpp/Cpp14/FileIO


### [`numpy.ndarray.tofile`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ndarray.tofile.html)  

`ndarray.tofile(fid,sep="",format="%s")`  

Write array to a file as text or binary (default).  

Data always written in 'C' order, independent of order of *a*.  

#### Parameters  
**fid** : *file or str*  
        An open file object, or string containing filename.  
**sep** : *str*
        Separator between array items for text output.  If "" (empty), a binary file is written, equivalent to `file.write(a.tobytes())`  
**format** : *str*
        Format string for text file output.  Each entry in the array is formatted to text by first converting it to closest Python type, and then using "format" % item.  

In [4]:
m=5
n=4

In [5]:
A = 11.111111*np.array(range(m*n),dtype=np.float32).reshape((m,n))

In [6]:
print(A)

[[   0.           11.11111069   22.22222137   33.33333206]
 [  44.44444275   55.55555344   66.66666412   77.777771  ]
 [  88.8888855   100.          111.11110687  122.22221375]
 [ 133.33332825  144.44444275  155.55554199  166.66665649]
 [ 177.777771    188.8888855   200.          211.11109924]]


In [7]:
Afilename = "A_mat_5_4.npy"

In [8]:
try:
    A.tofile(datafilefolder+ Afilename )
except IOError:
    if not os.path.exists(datafilefolder):
        os.makedirs(datafilefolder)

In [9]:
print(os.listdir(datafilefolder))
print(os.listdir(os.getcwd()))

['A_mat_5_4.npy']
['data', '.ipynb_checkpoints', 'FileIO.ipynb', 'FileIO_old.ipynb', 'binIO_playground.exe', 'binIO_playground.cu']


### [`numpy.fromfile`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.fromfile.html)  
```  
numpy.fromfile(file, dtype=float, count=-1, sep='')  
```  

In [10]:
A_in = np.fromfile(datafilefolder+ Afilename, dtype=np.float32)

In [11]:
print(A_in)

[   0.           11.11111069   22.22222137   33.33333206   44.44444275
   55.55555344   66.66666412   77.777771     88.8888855   100.
  111.11110687  122.22221375  133.33332825  144.44444275  155.55554199
  166.66665649  177.777771    188.8888855   200.          211.11109924]


Then go to CUDA C++14 file `binIO_playground.cu` or the C++14 version (serial version), `binIO_playground.cpp`.  Load it with [`std::ifstream`](http://www.cplusplus.com/reference/fstream/ifstream/), Input stream class to operate on files.  

The **most important thing to note** is that NumPy reshapes (`.reshape`) into **row-major ordering**, i.e.  

$$  
\lbrace 0,1,\dots mn-1 \rbrace \to \lbrace 0,1, \dots m-1 \rbrace \times \lbrace 0,1, \dots n-1 \rbrace \\
k \mapsto (k/n, k \mod{n}) \\ 
$$

and so we'll read in this binary file, with `std::ifstream` and `.read(...)` in C++ in row-major ordering assumed for the matrix $A$.  

If we want a column-major ordered matrix, we'll have to had loaded the matrix $A$ from the beginning to be column major ordered through `.tofile`.  

### `std::ifstream`   
```  
typedef basic_ifstream<char> ifstream;  
```  

### [`reinterpret_cast`](http://en.cppreference.com/w/cpp/language/reinterpret_cast)  
```  
reinterpret_cast < new_type > ( expression )  
```  
Returns a value of type new_type.  

### std::istream_iterator  

Note, cf. https://stackoverflow.com/questions/37588569/using-stdistream-iterator-to-read-binary-data-from-file-stops-prematuraly
" 	`istream_iterator` is an avatar of operator `>>`; it uses that operator to read from the stream. That is almost never what you want for reading binary data, because >> is a formatted input function. You could probably coerce it to do what you want by using manipulators such as noskipws on the stream, but it would still effectively remain a use of the wrong tool for the job.

If you want an iterator-based access to binary data in a stream, you might be better off using an `istreambuf_iterator` (which is guaranteed to work character by character) instead."  Angew

`std::istreambuf_iterator` is a single-pass input iterator that reads successive characters from the `std::basic_streambuf` object for which it was constructed.  

# Sample Datasets

http://www.stat.ufl.edu/~winner/datasets.html

# `.csv` to Pandas (Python), `.csv` to C++

In [13]:
import pandas
import pandas as pd

## Comma separated, `,`  

### has a header
cf. [
Viscosity of Polyacrylamide Copolymers by Concentration and Shear Rate   ](http://www.stat.ufl.edu/~winner/data/copolymer_viscosity.csv)

In [14]:
copoly_v_DF = pd.read_csv(datafilefolder + "copolymer_viscosity.csv")

In [16]:
copoly_v_DF.describe()

Unnamed: 0,SampleID,PolymerConc,ShearRate,Viscosity
count,25.0,25.0,25.0,25.0
mean,13.0,2420.0,291.72536,11.424
std,7.359801,1781.151313,377.970877,12.765131
min,1.0,100.0,10.0,1.1
25%,7.0,1000.0,31.6228,2.7
50%,13.0,2500.0,100.776,7.6
75%,19.0,3500.0,316.228,13.7
max,25.0,5000.0,1000.0,53.5


In [17]:
copoly_v_DF.head()

Unnamed: 0,SampleID,PolymerConc,ShearRate,Viscosity
0,1,100,10.0,1.4
1,2,100,31.6228,1.3
2,3,100,100.776,1.3
3,4,100,316.228,1.1
4,5,100,1000.0,1.1


## Whitespace separated, `       ` or ` `  

### no header

cf. [Manufacturing Learning Curves](http://www.stat.ufl.edu/~winner/data/manuf_learn.dat)

In [22]:
Manu_learn = pd.read_csv(datafilefolder+"manuf_learn.dat",header=None,delim_whitespace=True)

In [23]:
Manu_learn

Unnamed: 0,0,1,2,3,4,5,6,7
0,1,1,120,10,11,4.78749,2.30259,2.3979
1,1,2,140,10,8,4.94164,2.30259,2.07944
2,2,3,95,20,54,4.55388,2.99573,3.98898
3,2,4,125,20,25,4.82831,2.99573,3.21888
4,3,5,80,40,100,4.38203,3.68888,4.60517
5,3,6,75,40,80,4.31749,3.68888,4.38203
6,4,7,65,80,220,4.17439,4.38203,5.39363
7,4,8,50,80,150,3.91202,4.38203,5.01064
8,5,9,55,160,410,4.00733,5.07517,6.01616
9,5,10,40,160,500,3.68888,5.07517,6.21461


So the one possibility is to go from the raw file, to read in Python Pandas, and then take the panda DataFrame as a numpy array, and then output to a binary file with `.tofile` in numpy.  

One scenario is that we have C++ take in binary files, hdf5 files, and "strictly" comma-separated `.csv` files.  For everything else, Pandas can preprocess, especially tab-separated or whitespace separated files, and then output to binary.  

Remember that C++ is strongly type-cast whereas Python isn't; so be sure to specify type (usually `float32`) in NumPy, so that C++ will know how to deal with given char/binary.  

In [32]:
Manu_learn.values.astype(np.float32).shape

(20, 8)

In [33]:
try:
    Manu_learn.values.astype(np.float32).tofile(datafilefolder+ "manuf_learn.npy" )
except IOError:
    if not os.path.exists(datafilefolder):
        os.makedirs(datafilefolder)

In [34]:
manuf_learn_in = np.fromfile(datafilefolder+ "manuf_learn.npy", dtype=np.float32)

In [36]:
manuf_learn_in.shape

(160,)

In [37]:
manuf_learn_in = manuf_learn_in.reshape((20,8))

In [39]:
manuf_learn_in[:3,:]

array([[   1.        ,    1.        ,  120.        ,   10.        ,
          11.        ,    4.78748989,    2.30258989,    2.3979001 ],
       [   1.        ,    2.        ,  140.        ,   10.        ,
           8.        ,    4.9416399 ,    2.30258989,    2.07944012],
       [   2.        ,    3.        ,   95.        ,   20.        ,
          54.        ,    4.55388021,    2.99572992,    3.98898005]], dtype=float32)