# Processing Files

<div class="alert alert-block alert-info">
<h2>Overview</h2>

Questions:

* What are some standard data file formats used in chemistry?

* How do you read information from data files into python programs?

Objectives:

* Read information from standard data file formats into python codes.

* Use `for` loops and `if` statements to search for information in files.

* Split strings and convert (“cast”) them to numeric types
    
* Write information to files.

**Data needed**: This assume a `data/` folder (with `outfiles/` and several example outputs) is present in the same folder as this notebook.  If you cloned this notebook from the class repo on GitHub, this should be correct.


## Specifying the file path and reading a file

To analyze data from a file, we need to read that data into our code.  We will need to specify where the file is located that should be read into the code.  This is the concept of a file path.  You can think of a file path as the series of folders you would click on to get to the file if you were navigating your computer's file structure.  In this class, all of our data files are in a folder called `data` which is in the GitHub repo you cloned for class.  

In general, the steps to reading in a file to your code are:
1. Open the file with `open(file_path, 'r')`.  The file path is a string, so it should be defined in quotes.
2. Read all lines into a list of strings with `readlines()`.  The `readlines()` function outputs a list, where every element of the list is **one line** of the file.  So, when you call `readlines()` you will need to save the output to a list.
3. Close the file

In [None]:
ethanol_file = "data/outfiles/ethanol.out"
outfile = open(ethanol_file, "r")
data = outfile.readlines()
outfile.close()

print(f'The third line of the file is: {data[2]}')

<div class="alert alert-block alert-warning"> 
<strong>Check your understanding</strong>
    
Using skills that you learned last week, determine how many lines were in the file you just read in.  The key is remembering the way `readlines()` works; it outputs a list and each line from the file is one element of the list.

</div>

In [None]:
#Your code goes here

## Searching for a pattern in your file

The ethanol output contains a line with the phrase **`Final Energy`** that is the calculated energy of the molecule.  We want to search through the file and find this one peice of information.

In [None]:
# This is creating a blank variable to hold our line when we find it
# Somewhat like creating a blank list from last time
energy_line = None

for line in data:
    if 'Final Energy' in line:
        energy_line = line

print(F"Matched line: {energy_line}")

## Splitting strings and extracting a value

Looking at the matched line, the critical peice of information we want to extract is the number, that is, the energy of the molecule.  Since each line is a string, we can parse the information further by splitting the string to get the exact peice of information we want.  To split a string, you use the `.split()` function to break it into pieces.  The output of the function is a list of the different parts of the string.  The syntax is `list_after_splitting = orginal_string_name.split()`

By default, `.split()` splits on whitespace.  If you give it a different argument (in the parenthesis) it can split on another character like a `.` or `,`.

In [None]:
split_list = energy_line.split()
print(split_list)

In [None]:
# Now figure out which element of the list is the number you really want
energy = split_list[3]   # the numeric energy value as a string
print(f"Energy: {energy}")

## Casting to numeric types

Remember that the `readlines()` function read in everything from the file as a list of strings.  So even though the energy looks like a number, it is **still a string**.  If you attempted to do a math operation on the `energy` variable, it would give you an error.  

In [None]:
# This gives an error
energy_kJ_per_mole = energy*627.51
print(energy_kJ_per_mole)

<div class="alert alert-block alert-warning"> 
<strong>Check your understanding</strong>
    
Using skills that you learned last week, recast energy to the appropriate data type.  Should you overwrite the variable with the new type or use a new variable name?

</div>

In [None]:
# Your code goes here

### Python negative indexing

If the value you need is the **last** item in a list, you can use index `-1`.

In [None]:
energy_alt = float(split_list[-1])
print(f"Energy via words[-1]: {energy_alt}")


### A note on regular expressions

For complex matching (e.g., “match only at the start of a line” or “match a pattern like capital letter + digits”), Python’s `re` module (regular expressions) can help. Regex is outside the scope of this lesson, but it’s a valuable tool for advanced parsing.


## Processing Many Files at Once

Parsing one file in this way is useful, but you might be wondering why you couldn't just look through the file yourself, find the information you want, and copy-paste it.  The copy-and-paste method might be fine for one or two files, but what if you had **many** files with the same structure? 

### Using functions from python libraries
As we emphasized last week, many problems in code can be solved in standard ways; that is, you probably aren't the first person who needs to complete a particular computational task.  Someone else may have already written a function to do exactly the task you need to do.  Some of these functions are base python functions, but there are many **python libraries**, which are collections of functions, generally collected around a common topic or task.  To use a function from a library, there are two steps.
1. Import the library in your code, somewhere prior to where you want to use it:  `import library_name`
2. To call a function from a library, use the syntax `library_name.function_name()`.  Just like any other function, there may be input arguments which are required or optional. 


Back to our problem of parsing many files with the same file structure, we first need to assemble a list of of the files (or maybe even filepaths) we want to parse.  As you might expect, this kind of thing comes up in other codes, and there is a function that can search in a folder and find all the files of a specific type and make a list for you.  This function is in library called `glob`.  Before we can use the function, we need to import the library.  You can do this at any time in your code, but a common practice is import all your required libraries in your first code block, which you will see in future notebooks.  

In [None]:
# Don't forget to run the import cell!
import glob

Now we will define a pattern and search for all files that fit that pattern within our `data/outfiles` folder.  The function that does this in the `glob` library happens to also be called `glob`.  In defining our pattern to match, we can use the wildcard `*` which matches any characters.  

In [None]:
# This is the pattern to match
pattern = 'data/outfiles/*.out'

#Call the glob.glob function
#The input is the pattern we want to match
#The output is the list of filenames that match our pattern
filenames = glob.glob(pattern)
print(f"Number of files found: {len(filenames)}")
print(f"Example files: {filenames[:5]}")

## Looping over files and parsing energies

Now we want to do the file parsing we did before, but instead of doing it for just one file, we want to do it for every file in our list of filenames.  This sounds like a job for a `for` loop.  But then once we have one file open, we will need to use a `for` loop to search over the information in that file.  Using a `for` loop inside another `for` loop is called a **nested** for loop.  When using a nested for loop, you have to be very careful with indentation.   

In [None]:
# Create a blank list to store the energies once we finally get them
energies = []

# Outer for loop; loops over the list of filenames
for file in filenames:
    outfile = open(file, "r")
    data = outfile.readlines()
    outfile.close()

    for line in data:
        if 'Final Energy' in line:
            split_list = line.split()
            energy = float(split_list[3])
            energies.append(energy)
            print(energy)

# I don't print the whole list until the end
print(energies)

### Extract the molecule name from the filename

How could we make our output more useful?  Instead of just printing the entire list of energies at the end, we could label each energy with its corresponding molecule.  

Often, when you are implementing a new feature in your code, it is easier to try it out on one example and then implement it in your for loop. This was actually the structure of this lesson, where we learned to parse one file and then move on to many files.  

<div class="alert alert-block alert-warning"> 
<strong>Check your understanding</strong>
    
Consider the code block below.  We create a variable called `first_file` which is just the first filepath in our list of files.  Use what you learned about splitting strings earlier in this lesson to write code that would split up the full file name so you could get just the molecule name.  Remember, while the default option is for `.split()` to split on white space, you can specify an argument to split on a different character.

</div>

In [None]:
# Your code goes here
first_file = filenames[0]
print(first_file)


Now that we have figured out how to pull out the molecule name, we just implement that process in our `for` loop.

In [None]:
# Copy and paste the cell from above and implment new portion


## Writing results to a file

Suppose we now wanted to share our results with someone else, like our research advisor.  You might be tempted to just email them your whole notebook and tell them to run the code themselves, but that is generally not preferable.  Instead of just printing our results to the screen, we could write the data to a file.  There are only a few steps needed to change our code to write to a file.  
- Open a file that you want to write to
- Change the print line to a `filehandle.write` command.
- Don't forget to close the file you are writing to at the end.

You may also want to change what you were printing slightly to make it look better in the file.  If you write multiple things, it does not automatically go to the new line unless you request this with a new line character `\n`.

In [None]:
# Copy and paste cell from above and make changes

Now in your left hand file menu, you should see your `energy_results.txt` file.  Open it and make sure the formatting looks good.  If you want to change it, you just modify what is in your write statement.

## Final Activity: parse total energies from `03_Prod.mdout`

The file `data/03_Prod.mdout` is the output from a molecular dynamics simlutation, a type of computational chemistry calculation that models how molecules move.  At each step, the program calculates the energy of the molecule in its current configuration.  Parse the **total energy** values labeled `Etot` and write them (one per line) to a new file named `Etot.txt`.

**Target output format** (first several lines):

```
-4585.1049
-4573.5326
-4548.1223
...
```

Steps:
1. Open the `data/03_Prod.mdout` file and read in all the data. Don't forget to close!
2. Search through each line and find lines that contain `Etot`.
3. For those lines, parse out the energy and write it to a file.

For your lab report today, turn in your Etot.txt file on Canvas.

In [None]:
# Your code goes here
