# Introduction to Python for astronomy

Gilles Landais : gilles.landais@unistra.fr

##  lesson 3:  

Plan:
1. Reading/writing files (Second part)
buffering, binary files, parsing
2. Interact with an external program: communication with pipe
3. The regular expressions
4. The Numpy library (part 1)


## 1. Reading/writing files (second part)

We have seen in lesson2 how to open/write/update files in Python. 

### Bufferisation
In particular, with buffering you can control input/output with the filesystem. 

To improve the read/write efficiency, it requires to decrease the filesystem access. 
This operations are indeed slow: they need a physical access which is limited by the disk rotation speed.
It consists of decrease the input/output with buffering data and then to flush it in the disk.

The *open* function includes a parameter to manage buffering:
<pre>fd = open(filename, mode, bufferisation_option)</pre>

Buffering mode | Description
---------------|--------------
0 | no buffering
1 | line buffering
-1| default buffering drived by the OS confifguration
*size* | set the buffer size

- When to use buffering?

&rarr; to improve efficiency in writing/reading: 
the bigger the buffer size is, the faster write/read processes are

- When to **not** use buffering?

&rarr; to secure input/output transactions and to avoid some possible data loss 
(kill a program which writes into a file for instance)

### Writing in binary
- Binary writing decreases the disk storage: 
    binary file are smaller and consequently you optimize the size in the filesystem. 
    But also, you will reduce the input/output when reading/writing files

    Example: n=12345 use 5 octets in string storage, but only 4 in integer32 (32 bits) and 2 in integer16  
    
- Binary writing avoids to make type conversion (string &rarr; int, ...)

In [2]:
import os
import struct

def writeBin(filename):
    # open file in write binary mode
    fd = open(filename, "wb")
    
    # write string
    # NOTE! in python3 the byte function is required to write string in binary !
    fd.write(bytes("ceci est un test binaire\n", "UTF-8")) 
    
    # write integer
    i = 123456
    fd.write(struct.pack("i", i))
    
    # write float
    f = 3.14
    fd.write(struct.pack("f", f))
    
    # close file
    fd.close()
    
def readBin(filename):
    # open file in read binary mode
    fd = open(filename, "rb")
    
    # read a string
    # decode function is required to convert bytes into string (Python3)
    s = fd.readline().decode("utf-8") 
    
    # read integer
    i = fd.read(4)
    i = struct.unpack("i", i)[0]
    
    # read float
    f = fd.read(4)
    f = struct.unpack("f", f)[0]
    
    print ("READ: {:s}, int={:d}, float={:f}".format(s, i, f))
    fd.close()
    
writeBin("test.dat")
readBin("test.dat")

READ: ceci est un test binaire
, int=123456, float=3.140000


### Data serialisation
*serialization is the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer) or transmitted (for example, across a network connection link) and reconstructed later (possibly in a different computer environment)* (Wikipedia)

Python provides the **pickle** library in binary or ascii output.

In [1]:
import pickle
data = [{'ra': 0.000899, 'de':1.089009, 'hip': 1},
        {'ra': 0.004265, 'de':-19.498840, 'hip': 2},
        {'ra': 0.005024, 'de': 38.859279,'hip': 3}]

# serialise data into a file data.bck (in ascii)
output = open('data.bck', 'wb')
pickle.dump(data, output, 0)
output.close()

In [5]:
# serialise data into a file data.bck (in binary)
output = open('databin.bck', 'wb')
pickle.dump(data, output, -1)
output.close()

In [7]:
# list created files
for f in os.listdir():
    if f.find(".bck") > 0: print (f)

databin.bck
data.bck


In [8]:
# open the serialized file
input = open('databin.bck', 'rb')
datain = pickle.load(input)
print (datain)
input.close()

[{'hip': 1, 'ra': 0.000899, 'de': 1.089009}, {'hip': 2, 'ra': 0.004265, 'de': -19.49884}, {'hip': 3, 'ra': 0.005024, 'de': 38.859279}]


### Moving into a file 
Reading file is a step to step process (sequential): the place in the file is driven by a *cursor* 
(~pointer in the opened file). 

This is possible to move the *cursor* into the file:
    
- *tell()*: method to get the place where the cursor is
- *seek(n)*: method to move the cursor 

In [9]:
# read the file asu.ascii which is an ascii aligned files
# table coming from II/220 VizieR catalog : Polarisation of Be stars (McDavid, 1986-1999)
with open("asu.ascii") as fd:
    print (fd.read())



s|2H  Cam| 21291|1035|4.23|03 29 04.1 +59 56 25|B9Ia           |000
s|omi Sco|147084|6081|4.55|16 20 38.1 -24 10 08|A5II           |000
p|gam Cas|  5394| 264|2.47|00 56 42.2 +60 43 01|B0.5IVe        |230
p|phi Per| 10516| 496|4.07|01 43 39.4 +50 41 20|B1.5(V:)e-shell|400
p|48  Per| 25940|1273|4.04|04 08 39.5 +47 42 47|B4Ve           |200
p|zet Tau| 37202|1910|3.00|05 37 38.6 +21 08 34|B1IVe-shell    |220
p|48  Lib|142983|5941|4.88|15 58 11.3 -14 16 45|B3:IV:e-shell  |400
p|chi Oph|148184|6118|4.42|16 27 01.3 -18 27 21|B1.5Ve         |140
p|pi  Aqr|212571|8539|4.66|22 25 16.5 +01 22 38|B1III-IVe      |300
p|omi And|217675|8762|3.62|23 01 55.1 +42 19 34|B6III          |260



In [10]:
# go to the 3rd line : a line has 67 characters  (+1 '\n')
#(see ReadMe description ftp://cdsarc.u-strasbg.fr/pub/cats/II/220/ReadMe)
with open("asu.ascii") as fd:
    fd.seek((67+1)*2) 
    line = fd.readline()
    print ("record3: "+line)
    print ("tell={} (=68*3)".format(fd.tell()))

record3: p|gam Cas|  5394| 264|2.47|00 56 42.2 +60 43 01|B0.5IVe        |230

tell=204 (=68*3)


-------------------------
## 2. Interact with an external program: communication with pipe

### Execute a command in a python program


In [11]:
import os
os.system("sleep 5")

0

The upper code executes the unix command "sleep 5" in background. 
The command is independent of the python program in synchronous mode 
(the next python instruction is executed when the unix command is finished).

**Note**: There aren't any interactions between the python code and the process executed!

### Interract with an external program
In Unix, the *pipe* enable to have a communication between processes:

Example:<pre> cat /var/log/syslog | grep error</pre>

Using pipe is simple in Python, the api is similar as open a file.

In [12]:
import os
with os.popen("ls -1") as fd:
    for line in fd:
        print ("File:",line)

File: asu.ascii

File: asu.tsv

File: bibcat

File: bibcat.ori

File: binaryfile.py

File: cours3.new.pdf

File: cours3.odp

File: cours3.pdf

File: cours-python3.ipynb

File: data.bck

File: databin.bck

File: exemple_numpy.py

File: expre.py

File: fichier.py

File: hipparcos.tsv

File: hipparcos.txt

File: hip.tsv

File: initnumpy.py

File: initnumpy.pyc

File: matrice.py

File: np

File: numpy1.py

File: openfileTell.py

File: readinput.py

File: simplenumpy.py

File: slice.py

File: TD_1_prefilled.ipynb

File: td3_1bis.py

File: td3_1.py

File: td3_2bis.py

File: td3_2.py

File: TD3.ipynb

File: td3.txt

File: test.dat

File: testficbin.py

File: testpopen.py

File: test.py

File: t.py

File: tt.py

File: ttt.py

File: vizier.u-strasbg.fr



- The *pipe* descriptor enables communication with the process started.
- Pipe descriptor are available for STDIN, STDOUT and STDERR

In [14]:
import sys
import subprocess

def read_pipe(command):
    p = subprocess.Popen(["/bin/sh"], 
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.PIPE,
                     stdin=subprocess.PIPE)

    p.stdin.write(bytes(command, "UTF-8"))
    p.stdin.close()

    for line in p.stdout:
        print (line.decode("UTF-8").strip())
    p.stdout.close()

    for line in p.stderr:
        sys.stderr.write("(error) {}\n".format(line.decode("UTF-8").strip()))
    p.stderr.close()


In [15]:
read_pipe("ls -a")

.
..
asu.ascii
asu.tsv
bibcat
bibcat.ori
binaryfile.py
cours3.new.pdf
cours3.odp
cours3.pdf
cours-python3.ipynb
data.bck
databin.bck
exemple_numpy.py
expre.py
fichier.py
hipparcos.tsv
hipparcos.txt
hip.tsv
initnumpy.py
initnumpy.pyc
.ipynb_checkpoints
matrice.py
np
numpy1.py
openfileTell.py
readinput.py
simplenumpy.py
slice.py
TD_1_prefilled.ipynb
td3_1bis.py
td3_1.py
td3_2bis.py
td3_2.py
.td3_2.py.swo
TD3.ipynb
td3.txt
test.dat
testficbin.py
testpopen.py
test.py
t.py
tt.py
ttt.py
vizier.u-strasbg.fr


In [16]:
# execute a command that generates error
read_pipe("ls inexisting_file")

(error) ls: impossible d'accéder à inexisting_file: Aucun fichier ou dossier de ce type


-----------------------------
## 3. The regular expressions

Regular expressions enable to find a motif into a string.

Regular expressions are an extension of the *wildcard* used in UNIX.

Example: wildcard usage in the *ls* command
<pre> ls a*.py</pre>

In [17]:
import sys
def search_file_word(filename, word):
    """ A simple python code to search a word into a file
        filename: file name
        word: the word to find
    """
    with open(filename,'r') as fd:
        for line in fd:
            if line.find(word)>=0:
                print (line)

Regular Expresion completes the wildcard possibility :
- type differentiation: alpha-numeric char, punctuation , special character...
- repetition specification
- extract sub string to update

Regular expressions have a nomenclature which is defined outside than python. They were updated in the past by **Perl** language.

Example of language which works with reagular expressions:
- Perl
- Awk
- Python (import re)
- Java (import java.util.regex)
- ...


#### The most populare regular expressions:

Reg exp | Description
--------|-------------
.       | every character
^       | Begining of the line
$       | End of the line
[a-zA-Z0-9] | Every characters in the []
[^\t\n]     | Every characters which are **not** ibnt the []
\d          | a Number
\w          | alpha-num
\s          | white space, tab or caraige return (\n)

#### Repetition instructions

Reg exp | Description
--------|-------------
*       | Repetition (0 or more) of the previous characher 
+       | Repetition (1 or more) of the previous characher
?       | 0 or one previsous charachter
{n}     | previous character appears n-times

**Examples**:
- A file including its extension:    ^[^\.]\*\\..\*\$ 
- A date with format YYYY-MMM-JJ: ^\d{4}-\w+-\d{2}\$
- A mail (nom.prenom@provider.pays): ^[^\.]+\\.[^@]*@[^\.]+\w+$

#### select a sub string in a regular expression
Example : a date with format YYYY-MMM-JJ, example 2012-sept-01

^(\d{4})-(\w+)-(\d{2})\$

The syntax creates 3 groups delimitted with parenthesis ()
- group 1: \d{4}  &Rarr; 2012
- group 2: \w+  &Rarr; Sept
- group 3; \d{2} &Rarr; 01

#### Regular expression in Python
- import *re* module
- syntax closed from Java

In [18]:
import re

#s = input("date (format YY/MM/DD HH:mm:ss) ?")
s = "18/01/01 00:10:12"

if re.search("^\d{2}/[01]\d/[0123]\d +[012]\d:[0-5]\d:[0-5]\d$",s):
    print ("ok")

ok


Regular expression optimisation when it is used several time
<pre> re.compile(reg_exp)</pre>

In [19]:
s = "18/01/01 00:10:12"

reg = re.compile("^\d{2}/[01]\d/[0123]\d +[012]\d:[0-5]\d:[0-5]\d$")
if reg.search(s):
    print ("ok")

ok


#### Select a sub string into regular expression

In [20]:
s = "18/01/01 00:10:12"
reg = re.compile("^(\d{2})/([01]\d)/([0123]\d) +([012]\d):([0-5]\d):([0-5]\d)$")
mo = reg.search(s)
if mo:
    print ("ok:"+mo.group(1)+"-"+mo.group(2)+"-"+mo.group(3))

ok:18-01-01


-------------------------
## (pause) TD

## TD1 : search position in the VizieR catalogue J/ApJ/700/1299 having name coming from the Henry Draper catalogue (Name= HD+number)

1. Execute the Unix folowing command:

wget -O - "http://vizier.u-strasbg.fr/viz-bin/asu-tsv?-source=J/ApJ/700/1299/table2&-out.add=_RAJ,_DEJ&-out=Name&-out.max=1000" |egrep  '^ *[0-9].*'

2. Memorize data in an adapted structure indexed by the HD number (dictionary) 

3. Build a code which returns position from HD number


-------------------------------
## 4. The Numpy library (part 1)
![Image of Numpy](https://scipy.org/_static/images/numpylogo_med.png)
(http://numpy.scipy.org/, http://www.scipy.org/)

- library to work with vectors
- numpy is coded in C and included into python
- needs installation:
    - from sources: 
    <pre>python setup.py install</pre>
    - (Ubuntu/Debian) using apt-get : 
    <pre>apt-get install python-numpy</pre>
    - with pip tools: 
    <pre>pip install python</pre>

#### A first example : make a vector addition

In [22]:
# python code without numpy
def sum_vecteur(a,b):
    c = []
    for i in range(len(a)):
        c.append(a[i]+b[i])
    return c

a = (1.,2.3,4.5)
b = (-2.3,4.5,9.9)
print (sum_vecteur(a,b))

[-1.2999999999999998, 6.8, 14.4]


In [23]:
# code using numpy
import numpy

a = (1.,2.3,4.5)
b = (-2.3,4.5,9.9)

na = numpy.array(a)
nb = numpy.array(b)
print (na+nb)

[ -1.3   6.8  14.4]


### Numpy datatype

Data type | Description 
----------|----------
bool | boolean
int | integer (int32 or int64)
int8, int16, int32, int64 | integer
uint8, uint16,... | unsigned integer
float | real (float64)
float8, float16, float32, float64 | real
complex | complex number, e.g.: 1+j
complex64, complex128 | complex number

### Declare and intialize Numpy structure

Type | Description
-----| ----------
numpy.array | vector, e.g.: np.array([1,2,3])
numpy.ndarray | multi-dimensional vector
numpy.matrix | Matrix, e.g.: numpy.matrix([[1,0],[-0,1]])

**ex numpy.ndarray: **


In [24]:
n = numpy.ndarray(shape=(2,2), dtype='int',buffer=numpy.array([[1,0],[1,2]]))
print (n)

[[1 0]
 [1 2]]


### Numpy initialisation

In [2]:
import numpy as np

na = np.arange(10)
print (na)

c = (1,2,3)
nc = np.array(c)
print (nc)

na = np.arange(0,0.5,0.1)
print (na)

[0 1 2 3 4 5 6 7 8 9]
[1 2 3]
[ 0.   0.1  0.2  0.3  0.4]


### Mathematical operations

- operation on vector/matrix
- addition, scalar multiplication, matrix multiplication
- trigonomtric fucntions and usual math. functions:

In [26]:
a= np.array((1,2.2,4))
print (3*a)

[  3.    6.6  12. ]


**Exercise: **
- create a numpy array
- multiply the vector by a scalar

**Matrix example:**
- build a matrix 3x3: m = numpy.matrix([[2.1,2.2,3.1],[-1,0.5,-2],[1.1,0,-1.2]])
- compute the inverted matrix (*m.getI()*) and the trasposed matrix (*m.getT()*)
- compute the product of the matrix with its inverted matrix, ans see the result

In [27]:
m = numpy.matrix([[2.1,2.2,3.1],[-1,0.5,-2],[1.1,0,-1.2]])
print (m)

[[ 2.1  2.2  3.1]
 [-1.   0.5 -2. ]
 [ 1.1  0.  -1.2]]


In [28]:
mm=m.getI()
print (mm)

[[ 0.05744375 -0.25275251  0.56965055]
 [ 0.3255146   0.56773576 -0.10531355]
 [ 0.05265677 -0.2316898  -0.31115366]]


In [29]:
print (mm*m)

[[  1.00000000e+00   0.00000000e+00   2.52975663e-17]
 [ -2.41482808e-18   1.00000000e+00  -3.24324366e-17]
 [  5.67069396e-18   5.55111512e-17   1.00000000e+00]]


### Extract a sub part of a numpy array (slicing)

Unlike the python array sub-selection (e.g.: tab[n,m]), the numpy sub-selection is **not a copy**

In [30]:
a = np.arange(10)
b = a[1 :4]
b[0] = -1
print (a)

[ 0 -1  2  3  4  5  6  7  8  9]


In [11]:
# initialise numpy array with random values
a = np.array([(0,0,0),(0,0,0),(0,0,0)])
b = a[1:3,1:3]
print (b)

b[0,0] = -1
print (a)

[[0 0]
 [0 0]]
[[ 0  0  0]
 [ 0 -1  0]
 [ 0  0  0]]


In [32]:
# Make a numpy array copy

c = b.copy()
c[0,0] = -2
print (c)
print (b)

[[ -2.00000000e+00  -3.24324366e-17]
 [  5.55111512e-17   1.00000000e+00]]
[[ -1.00000000e+00  -3.24324366e-17]
 [  5.55111512e-17   1.00000000e+00]]


### Append Numpy array

- the '+' operator is not available to concatenate numpy arrays (as it was for Python arrays) because the '+' operator is defined as vector addition.
- The method *.append* is not available 

Adding a new element to a Numpy array is possible with the function: *numpy.append(..)*

**Warning**: *numpy.append* creates a new variable!

In [33]:
a = np.array([1.1,2.2,3.3])

try:
    a.append([4.4,5.5])
except Exception as err:
    print (err)

b = np.append(a, [4.4,5.5])
print (b)

a[0] -= 1.1
print (a)
print (b)

'numpy.ndarray' object has no attribute 'append'
[ 1.1  2.2  3.3  4.4  5.5]
[ 0.   2.2  3.3]
[ 1.1  2.2  3.3  4.4  5.5]


### Append a multi-dimensionnal Numpy Array

*concatenate((arr1,arr2), axis=..)* : concatenate 2 numpy arrays

*numpy.c_[]* : add a new column

*numpy.r_[]* : add a new record

*reshape(x,y)* : create a new array with the original values

see : http://wiki.scipy.org/Tentative_NumPy_Tutorial

In [34]:
a = np.array([(1.1,2.2,3.3),(-1.1,-2.2,-3.3)])
b = np.c_[a,[0,0]]
print (b)

a = np.arange(20)
b = a.reshape(4,5)
print (b)

d = np.array([1,2,3,4]).reshape(4,1)
c = np.concatenate((b,d), axis=1)
print (c)

[[ 1.1  2.2  3.3  0. ]
 [-1.1 -2.2 -3.3  0. ]]
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]
[[ 0  1  2  3  4  1]
 [ 5  6  7  8  9  2]
 [10 11 12 13 14  3]
 [15 16 17 18 19  4]]


### Working with Numpy tables

Multi-dimensional array are similar to tables with columns and records (or lines). 
Numpy enables to give a name and a type for every columns.

However, Numpy table manipulation is tedious.. We will see in the next session a more user-friendly interface 
*astropy*

In [35]:
# create a datatype including name+type for a table having 3 columns
dt = np.dtype([('field1','f8'),('field2','f8'),('field3','f8')])

# create a table using the previous datatype defintion
a = np.array((1.1,2.2,3.3), dtype=dt)
print (a['field1'])


1.1


In [38]:
# direct initialisation 
npval = np.array([
(950,5.766,5.22),
(951,3.766,7.828),
(952,8.46,8.481)], dtype=[('hip','i8'),('btmag','f8'),('vtmag','f8')])
print (npval['btmag']*2)
print (npval[0])

[ 11.532   7.532  16.92 ]
(950,  5.766,  5.22)


**Note** : a common error consists to define a datatype with brackets instead of parenthesis


In [39]:
try:
    a = np.array([1.1,2.2,3.3], dtype=dt)
except Exception as err:
    print (err)

'float' does not support the buffer interface


#### Initialise an array with zeros:

In [40]:
dt = np.dtype([('ra','f8'),('dec','f8'),('hip','i8')])
data = np.zeros(10, dtype=dt)
print (data)

[( 0.,  0., 0) ( 0.,  0., 0) ( 0.,  0., 0) ( 0.,  0., 0) ( 0.,  0., 0)
 ( 0.,  0., 0) ( 0.,  0., 0) ( 0.,  0., 0) ( 0.,  0., 0) ( 0.,  0., 0)]


--------------------------------------------
## TD (part 2)

Use Numpy to compute the color of hipparcos objects.

- Download from VizieR a subset of Hipparcos catalogue:
wget -O - "http://vizier.u-strasbg.fr/viz-bin/asu-tsv?-source=I/239/hip_main&-oc.form=dec&-out.add=_RAJ,_DEJ&-out=HIP,BTmag,VTmag&-out.max=50&BTmag=>0&VTmad=>0"|egrep  '^ *[0-9].*'>hipparcos.tsv

The result is a file in TSV format.

- create a Numpy array to store the Hipparcos file

    - Give a datatype to the structure (using *numpy.dtype(...)*)
    
- Fill the array with values coming from the TSV files :
    - read the file line by line 
    - split each line in an Python array (an Array of tuple)
    - fill the numpy array using the Python array

- Compute the color  B-V = BTmag-VTmag
- Compute the angular distance from (0,0)
  $\sqrt{ra^{2}+de^{2}}$
- print distances less than 45deg

In [2]:
import os
os.system("wget -O - \"http://vizier.u-strasbg.fr/viz-bin/asu-tsv?-source=I/239/hip_main&-oc.form=dec&-out.add=_RAJ,_DEJ&-out=HIP,BTmag,VTmag&-out.max=50&BTmag=>0&VTmad=>0\"|egrep '^ [0-9].'>hipparcos.tsv")

        

256

In [None]:

import numpy
dt = np.dtype([('ra','f8'),('dec','f8'),('hip','i8'),('btmag','f8'),('vtmag','f8')])

with open("hipparcos.tsv", "r") as fd:
    for line in fd:
        rec = line.split("\t")
        