In [1]:
# First lets import some libraries we will use...
import numpy as np
import scipy as sp
import pandas as pd

xyz_path = '1000.xyz'   # File path
nframe = 1000           # Number of frames (or snapshots)
nat = 195               # Number of atoms
a = 12.55               # Cell size

# First approach

Write a function that reads an xyz trajectory file in. We are going to need to be able to separate numbers from atomic symbols; an XYZ trajectory file looks like:

```
nat [unit]
[first frame]
symbol1 x11 y11 z11
symbol2 x21 y21 z21
nat [unit]
[second frame]
symbol1 x12 y12 z12
symbol2 x22 y22 z22
```

Stuff in [ ] are optional (if units are absent, angstroms are assumed; a blank is included if no comments are present).

Here is an example file parser. All it does is read line by line and return a list of these lines.

In [2]:
def skeleton_naive_xyz_parser(path):
    '''
    Simple xyz parser.
    '''
    # Read in file
    lines = None
    with open(path) as f:    
        lines = f.readlines()
    # Process lines
    # Return processed lines
    return lines

lines = skeleton_naive_xyz_parser(xyz_path)
lines

['195\n',
 '\n',
 'Na 4.53636544 7.754558099999999 5.09849671\n',
 'O 9.9696927 11.411852500000002 3.88980048\n',
 'O 7.297978200000001 14.645999800000002 1.0393263000000001\n',
 'O 12.633736700000002 2.164139 8.5495922\n',
 'O 7.9961524000000015 5.6460170000000005 4.58720763\n',
 'O 1.32640402 0.35020276100000003 14.9669441\n',
 'O 2.14528015 -1.9941395600000005 13.265415599999999\n',
 'O 8.183841600000001 3.2246782000000005 3.1568286600000004\n',
 'O 11.323297900000002 9.1119192 10.615390099999999\n',
 'O 0.327833467 11.2123666 16.535482000000002\n',
 'O 6.0658843000000005 0.12356170400000002 8.1916708\n',
 'O 13.399940300000003 8.361955900000002 12.0152828\n',
 'O 8.922537199999999 7.904476700000001 10.9249609\n',
 'O -0.163887903 6.3763015 6.749299400000001\n',
 'O 9.777030900000002 5.3322142 6.9719026\n',
 'O 10.9277646 2.52648813 11.079746300000002\n',
 'O 6.506382599999999 7.6740443 8.795259800000002\n',
 'O 8.143426 11.535003699999999 12.1639847\n',
 'O 6.7587684999999995 8.022

**CODING TIME: Try to expand the skeleton above to convert the line strings into 
into a list of xyz data rows (i.e. convert the strings to floats).**

If you can't figure out any approach, run the cell below which will print one possible (of many) ways of 
approaching this problem.

***Note that you may have to run "%load" cells twice, once to load the code and once to instantiate the function.***

In [4]:
# %load -s naive_xyz_parser, snippets/parsing.py
def naive_xyz_parser(path):
    '''
    Simple xyz parser

    Args:
        path (str): String xyz file path

    Returns:
        data (list): List of lists of xyz trajectory rows (excluding comments and atoms)
    '''
    data = []
    with open(path) as f:
        for line in f:
            line_split = line.split(' ')
            if len(line_split) == 4:
                try:
                    float(line_split[1])     # Check that this is not a comment line
                    data.append([line_split[0], float(line_split[1]), float(line_split[2]), float(line_split[3])])
                except:
                    pass                     # Ignore the line if it is a comment
    return data


In [6]:
data = naive_xyz_parser(xyz_path)
data

[['Na', 4.53636544, 7.754558099999999, 5.09849671],
 ['O', 9.9696927, 11.411852500000002, 3.88980048],
 ['O', 7.297978200000001, 14.645999800000002, 1.0393263000000001],
 ['O', 12.633736700000002, 2.164139, 8.5495922],
 ['O', 7.9961524000000015, 5.6460170000000005, 4.58720763],
 ['O', 1.32640402, 0.35020276100000003, 14.9669441],
 ['O', 2.14528015, -1.9941395600000005, 13.265415599999999],
 ['O', 8.183841600000001, 3.2246782000000005, 3.1568286600000004],
 ['O', 11.323297900000002, 9.1119192, 10.615390099999999],
 ['O', 0.327833467, 11.2123666, 16.535482000000002],
 ['O', 6.0658843000000005, 0.12356170400000002, 8.1916708],
 ['O', 13.399940300000003, 8.361955900000002, 12.0152828],
 ['O', 8.922537199999999, 7.904476700000001, 10.9249609],
 ['O', -0.163887903, 6.3763015, 6.749299400000001],
 ['O', 9.777030900000002, 5.3322142, 6.9719026],
 ['O', 10.9277646, 2.52648813, 11.079746300000002],
 ['O', 6.506382599999999, 7.6740443, 8.795259800000002],
 ['O', 8.143426, 11.535003699999999, 12.1

# DataFrames

People spend a lot of time reading code, especially their own code.

Lets do two things in using DataFrames: make our code more readable
and *not* reinvent the wheel (i.e. parsers). We have pride in the 
code we write! 

First an example of using DataFrames...

In [7]:
np.random.seed = 1
df = pd.DataFrame(np.random.randint(0, 10, size=(6, 4)), columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,0,4,4,2
1,3,4,8,1
2,4,2,2,5
3,5,2,7,6
4,5,0,7,3
5,4,5,3,7


In [8]:
df += 1
df

Unnamed: 0,A,B,C,D
0,1,5,5,3
1,4,5,9,2
2,5,3,3,6
3,6,3,8,7
4,6,1,8,4
5,5,6,4,8


In [9]:
df.loc[:, 'A'] = [0, 0, 1, 1, 2, 2]
df

Unnamed: 0,A,B,C,D
0,0,5,5,3
1,0,5,9,2
2,1,3,3,6
3,1,3,8,7
4,2,1,8,4
5,2,6,4,8


In [10]:
df.groupby('A')[['B', 'C', 'D']].apply(lambda f: f.sum())

Unnamed: 0_level_0,B,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,10,14,5
1,6,11,13
2,7,12,12


# Second approach: pandas.read_csv

Like 99% (my estimate) of all widely established Python packages, pandas is very well 
[documented](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html).

Let's use this function of pandas to read in our well structured xyz data. The **names** argument (see function below) allows us to specific the column names
and the **delim_whitespace** arguments means that we parse tsv (tab or space separated files).

**CODING TIME: Figure out what options we need to correctly parse in the XYZ trajectory data using pandas.read_csv**

In [11]:
def skeleton_pandas_xyz_parser(path):
    '''
    Parses xyz files using pandas read_csv function.
    '''
    # Read from disk
    df = pd.read_csv(path, delim_whitespace=True, names=['symbol', 'x', 'y', 'z'])
    # Remove nats and comments
    # 
    #
    return df

In [12]:
df = skeleton_pandas_xyz_parser(xyz_path)
df.head()

Unnamed: 0,symbol,x,y,z
0,195,,,
1,Na,4.536365,7.754558,5.098497
2,O,9.969693,11.411853,3.8898
3,O,7.297978,14.646,1.039326
4,O,12.633737,2.164139,8.549592


One possible solution (**run this only if you have already finished the above!**):

In [14]:
# %load -s pandas_xyz_parser, snippets/parsing.py
def pandas_xyz_parser(path):
    '''
    Parse xyz files using pandas read_csv.

    Args:
        path (str): XYZ file path

    Returns:
        df (:class:`pandas.DataFrame`): Table of XYZ data
    '''
    df = pd.read_csv(path, delim_whitespace=True, names=['symbol', 'x', 'y', 'z'])    # Read data from disk
    df.dropna(inplace=True)                                                           # Drop nat line
    df[['x', 'y', 'z']] = df[['x', 'y', 'z']].astype(np.float)                        # Convert types
    return df


In [15]:
df = pandas_xyz_parser(xyz_path)
df.head()

Unnamed: 0,symbol,x,y,z
1,Na,4.536365,7.754558,5.098497
2,O,9.969693,11.411853,3.8898
3,O,7.297978,14.646,1.039326
4,O,12.633737,2.164139,8.549592
5,O,7.996152,5.646017,4.587208


# Testing your functions is key

A couple of quick tests should suffice...though these barely make the cut...

In [18]:
print(len(df) == nframe * nat)    # Make sure that we have the correct number of rows
print(df.dtypes)                  # Make sure that each column's type is correct

True
symbol     object
x         float64
y         float64
z         float64
dtype: object


# Lets attach a meaningful index
This is easy since we know the number of atoms and number of frames...

In [19]:
df = pandas_xyz_parser(xyz_path)
df.index = pd.MultiIndex.from_product((range(nframe), range(nat)), names=['frame', 'atom'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,symbol,x,y,z
frame,atom,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,Na,4.536365,7.754558,5.098497
0,1,O,9.969693,11.411853,3.889800
0,2,O,7.297978,14.646000,1.039326
0,3,O,12.633737,2.164139,8.549592
0,4,O,7.996152,5.646017,4.587208
0,5,O,1.326404,0.350203,14.966944
0,6,O,2.145280,-1.994140,13.265416
0,7,O,8.183842,3.224678,3.156829
0,8,O,11.323298,9.111919,10.615390
0,9,O,0.327833,11.212367,16.535482


**CODING TIME: Put it parsing and indexing together into a single function..**

In [21]:
# %load -s parse, snippets/parsing.py
def parse(path, nframe, nat):
    '''
    Complete parsing of xyz files.
    '''
    df = pandas_xyz_parser(path)
    df.index = pd.MultiIndex.from_product((range(nframe), range(nat)), names=['frame', 'atom'])
    return df


# Saving your work!

We did all of this work parsing our data, but this Python kernel won't be alive eternally so lets save
our data so that we can load it later (i.e. in the next notebook!).

We are going to create an [HDF5](https://www.hdfgroup.org/HDF5/) store for saving our DataFrame(s) to. HDF is a high performance,
portable, binary data storage format designed with scientific data exchange in mind. Use it! Also note that
pandas has [extensive](http://pandas.pydata.org/pandas-docs/stable/io.html) IO functionality.

In [22]:
xyz = parse(xyz_path, nframe, nat)
store = pd.HDFStore('xyz.hdf5', mode='w')
store.put('xyz', xyz)
store.close()

Though there are a bunch of improvements/features we could make to our parse function...

# ...lets move on to step [two](02_distances.ipynb)