In [None]:
# First lets import some libraries we will use...
import numpy as np
import scipy as sp
import pandas as pd

xyz_path = 'na.xyz'     # File path
nframe = 2000           # Number of frames (or snapshots)
nat = 195               # Number of atoms
a = 12.55               # Cell size

# First approach

Write a function that reads an xyz trajectory file in. We are going to need to be able to separate numbers from atomic symbols; an XYZ trajectory file looks like:

```
nat [unit]
[first frame]
symbol1 x11 y11 z11
symbol2 x21 y21 z21
nat [unit]
[second frame]
symbol1 x12 y12 z12
symbol2 x22 y22 z22
```

Stuff in [ ] are optional (if units are absent, angstroms are assumed; a blank is included if no comments are present).

Here is an example file parser. All it does is read line by line and return a list of these lines.

In [None]:
def skeleton_naive_xyz_parser(path):
    '''
    Simple xyz parser.
    '''
    # Read in file
    lines = None
    with open(path) as f:    
        lines = f.readlines()
    # Process lines
    # Return processed lines
    return lines

lines = skeleton_naive_xyz_parser(xyz_path)
lines

**CODING TIME: Try to expand the skeleton above to convert the line strings into 
into a list of xyz data rows (i.e. convert the strings to floats).**

If you can't figure out any approach, run the cell below which will print one possible (of many) ways of 
approaching this problem (*note that you may have to run the cell twice*).

In [None]:
%load -s naive_xyz_parser, snippets/parsing.py

In [None]:
data = naive_xyz_parser(xyz_path)
data

# DataFrames

People spend a lot of time reading code, especially their own code.
Lets do two things in using DataFrames: make our code more readable
and *not* reinvent the wheel (i.e. parsers). We have pride in the 
code we write! First an example of using DataFrames

In [None]:
np.random.seed = 1
df = pd.DataFrame(np.random.randint(0, 10, size=(6, 4)), columns=['A', 'B', 'C', 'D'])
df

In [None]:
df += 1
df

In [None]:
df.loc[:, 'A'] = [0, 0, 1, 1, 2, 2]
df

In [None]:
df.groupby('A')[['B', 'C', 'D']].apply(lambda f: f.sum())

# Second approach: pandas.read_csv

Like 99% (my estimate) of all widely established Python packages, pandas is very well 
[documented](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html).

Let's use this function of pandas to read in our well structured xyz data. The **names** argument (see function below) allows us to specific the column names
and the **delim_whitespace** arguments means that we parse tsv (tab or space separated files).

**CODING TIME: Figure out what options we need to correctly parse in the XYZ trajectory data using pandas.read_csv**

In [None]:
def skeleton_pandas_xyz_parser(path):
    '''
    Parses xyz files using pandas read_csv function.
    '''
    # Read from disk
    df = pd.read_csv(path, delim_whitespace=True, names=['symbol', 'x', 'y', 'z'])
    # Remove nats and comments
    # 
    #
    return df

In [None]:
df = skeleton_pandas_xyz_parser(xyz_path)
df.head()

One possible solution (**run this only if you have already finished the above!**):

In [None]:
# %load -s pandas_xyz_parser, snippets/parsing.py
def pandas_xyz_parser(path):
    '''
    Parse xyz files using pandas read_csv.

    Args:
        path (str): XYZ file path

    Returns:
        df (:class:`pandas.DataFrame`): Table of XYZ data
    '''
    df = pd.read_csv(path, delim_whitespace=True, names=['symbol', 'x', 'y', 'z'])    # Read data from disk
    indexes_to_discard = df.loc[df['symbol'].str.isdigit(), 'symbol'].index           # Get indexes of nat lines
    indexes_to_discard = indexes_to_discard.append(indexes_to_discard + 1)            # and comment lines
    df = df.loc[~df.index.isin(indexes_to_discard)].reset_index(drop=True)            # Discard them
    df[['x', 'y', 'z']] = df[['x', 'y', 'z']].astype(np.float)                        # Convert types
    return df


In [None]:
df = pandas_xyz_parser(xyz_path)
df.head()

# Testing your functions is key

A couple of quick tests should suffice...

In [None]:
len(df) == nframe * nat

In [None]:
df.dtypes

# Lets attach a useful index (for later)

This is easy since we know the number of atoms and number of frames...

In [None]:
df = pandas_xyz_parser(xyz_path)
df.index = pd.MultiIndex.from_product((range(nframe), range(nat)), names=['frame', 'atom'])
df

In [None]:
# %load -s parse, snippets/parsing.py
def parse(path, nframe, nat):
    '''
    Complete parsing of xyz files.
    '''
    df = pandas_xyz_parser(path)
    df.index = pd.MultiIndex.from_product((range(nframe), range(nat)), names=['frame', 'atom'])
    return df

# Saving your work!

We did all of this work parsing our data, but this Python kernel won't be alive eternally so lets save
our data so that we can load it later (i.e. in the next notebook!).

We are going to create an [HDF5](https://www.hdfgroup.org/HDF5/) store for saving our DataFrame(s) to. HDF is a high performance,
portable, binary data storage format designed with scientific data exchange in mind. Use it! Also note that
pandas has [extensive](http://pandas.pydata.org/pandas-docs/stable/io.html) IO functionality.

In [None]:
xyz = parse(xyz_path, nframe, nat)
store = pd.HDFStore('xyz.hdf5', mode='w', complevel=9, complib='zlib')
store.put('xyz', xyz)
store.close()

Though there are a bunch of improvements/features we could make to our parse function...

# ...lets move on to step [two](02_distances.ipynb)