In [1]:
%autosave 0

Autosave disabled


In [2]:
from IPython.core.display import HTML
css_file = '../../../style/style03my.css'
HTML(open(css_file, "r").read())

>### [Sergio Rojas](http://prof.usb.ve/srojas)<br>
[Departamento de F&iacute;sica](http://www.fis.usb.ve/), [Universidad Sim&oacute;n Bol&iacute;var](http://www.usb.ve/), [Venezuela](http://es.wikipedia.org/wiki/Venezuela)

>#### Content under [Creative Commons Attribution license CC-BY 4.0](http://creativecommons.org/licenses/by/4.0/), [code under MIT license (c)](http://en.wikipedia.org/wiki/MIT_License)2016-2017 Sergio Rojas (srojas@usb.ve).###

# <center> NumPy functionality for <font color=red>reading</font> and  <font color=red>writing</font> data from file</center>

An important issue when performing scientific computation is to count with an efficient way to read from and write to disk data because such operations, thought important, are in general very slow particularly in this era of big data analysis.

In this regards, NumPy offer a simple enough functionality for the purpose of reading non-binary data from disk: [genfromtxt()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html) [http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html], and write (store) data to disk: [savetxt()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html) [http://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html]. You can browse them to familiarize yourself with the many options they offer to read and write text data.

# Reading data with genfromtxt()

From the main documentation of [genfromtxt()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html) [http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html] a basic usage of such function to read data could contain:

<ul>
<li> 
    The **filename** as the only mandatory argument which contain the full path of the
                   file to be read.
<li>
    The keywords **skip_header** and **skip_footer** to indicate lines to skip
                   at the top or at the bottom of the file
<li>
The keyword **delimiter** to indicates how multi-columns of data are
                    separate
<li>    The keyword **dtype** indicating the data type.
<li>    The keyword **filling_values** indicating how to treat missing values.
<li>    The keyword **usecols** to indicates  which columns to read
<li>    The keyword **names** to gives the names of each column
           as a list of strings.
</ul>

In [3]:
from IPython.display import HTML

# An example

The following example shows how genfromtxt() can be used to read the data from
the **Iris Plants Database** which can be obtained from the repository 
[https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data) 
described at [https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names) 
and that we show below.

## The data 

Here we can see that the files contains five comma separated columns, being the first four columns of numeric (float) type and the last (fifth) column of string type (describing the class of the data. Exploring the data set is important so we can know its structure before reading it.

In [4]:
HTML('<iframe src=https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data width=700 height=50></iframe>')

## The data description 

Reading the description of the data we can see that the numeric data represents
the sepal length, the sepal width, the petal length, and the petal width of the
classes of plants reported in the data set.

In [5]:
HTML('<iframe src=https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names width=700 height=150></iframe>')

Using this information we can build the following function to read this data into a Python session. In this example we are assuming that the data set has been downloaded and stored as **iris.data** in a directory named **Data_set** under the current one. The function reads as follows:

In [6]:
def read_iris(datafile):
    if sys.version[0]=="2":
        lines=np.genfromtxt(datafile, delimiter=",", 
                               dtype=[('sepal_length',float), ('sepal_width',float),
                                      ('petal_length',float), ('petal_width',float),
                                      ('class','|S30')])
    elif sys.version[0]=="3":
          lines=np.genfromtxt(datafile, delimiter=",", 
                               dtype=[('sepal_length',float), ('sepal_width',float),
                                      ('petal_length',float), ('petal_width',float),
                                      ('class','U30')])
    return lines

In [7]:
import numpy as np
import sys

datafile = './Data_set/iris.data'

thefilecontent = read_iris(datafile)

In [8]:
print(thefilecontent['sepal_length'][0:10])

[ 5.1  4.9  4.7  4.6  5.   5.4  4.6  5.   4.4  4.9]


In [9]:
print(thefilecontent['petal_length'][0:10])

[ 1.4  1.4  1.3  1.5  1.4  1.7  1.4  1.5  1.4  1.5]


In [10]:
print(thefilecontent['class'][0:10])

['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa']


# Writing data to a file using savetxt()

From the main documentation of [savetxt()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html) [http://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html] a basic usage of such function to write data to a file could contain:

<ul>
<li> 
    The **filename** a mandatory argument which contains the full path of the
                   file to which data is going to be written.
<li> The **data** a mandatory array-like argument which contains the data to 
                    be written to file.
<li> The keyword **fmt** to indicate the format the data is going to be written to file.
<li>    The keyword **delimiter** indicating the separation character between columns of data.
<li>    The keyword **newline** indicating the new line character.
<li>    The keyword **header** to add a brief description of the data.
</ul>

Before writing data, it is necessary to organize it in an array-like variable in the way we want the data to appear in the file. 

In our example, let's consider we would like to save column-wise in a file the first 10 lines of the variables 'sepal_length', 'petal_length' and 'class'.

The data is organized in the following way:

In [11]:
datatosave = np.transpose(np.array([thefilecontent['sepal_length'][0:9],
              thefilecontent['petal_length'][0:9],
              thefilecontent['class'][0:9]]))
print(datatosave)

[['5.1' '1.4' 'Iris-setosa']
 ['4.9' '1.4' 'Iris-setosa']
 ['4.7' '1.3' 'Iris-setosa']
 ['4.6' '1.5' 'Iris-setosa']
 ['5.0' '1.4' 'Iris-setosa']
 ['5.4' '1.7' 'Iris-setosa']
 ['4.6' '1.4' 'Iris-setosa']
 ['5.0' '1.5' 'Iris-setosa']
 ['4.4' '1.4' 'Iris-setosa']]


In [12]:
datafile = './Data_set/new_iris.data'

theformat = '%9s %14s %20s'

np.savetxt( datafile,  datatosave,    delimiter='   ',
    newline='\n',   fmt=theformat, 
    header=' sepal_length   petal_length     class')

After executing the instructions the file 'new_iris.data' will be create (or, if it already exist, overwritten) in the 'Data_set' directory of the current Jupyter notebook.

The content of the file is displayed using:

In [13]:
!more ./Data_set/new_iris.data

#  sepal_length   petal_length     class
      5.1            1.4          Iris-setosa
      4.9            1.4          Iris-setosa
      4.7            1.3          Iris-setosa
      4.6            1.5          Iris-setosa
      5.0            1.4          Iris-setosa
      5.4            1.7          Iris-setosa
      4.6            1.4          Iris-setosa
      5.0            1.5          Iris-setosa
      4.4            1.4          Iris-setosa


### Additional readings

In [14]:
from IPython.display import HTML

In [15]:
HTML('<iframe src=http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html width=700 height=250></iframe>')

In [16]:
HTML('<iframe src=http://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html width=700 height=250></iframe>')

In [17]:
HTML('<iframe src=http://docs.scipy.org/doc/numpy/reference/routines.io.html width=700 height=250></iframe>')

>#### Content under [Creative Commons Attribution license CC-BY 4.0](http://creativecommons.org/licenses/by/4.0/), [code under MIT license (c)](http://en.wikipedia.org/wiki/MIT_License)2016-2017 Sergio Rojas (srojas@usb.ve). ###
