<div style="text-align:center;">
  <img src="https://github.com/act-cms/foundational-modules/blob/main/intro-programming/colab-notebooks/images/act-cms-header.png?raw=true">
</div>
<div>
<center>
<a href="https://act-cms.molssi.org/">act-cms.molssi.org</a>

# Reading Information from Files (File Parsing)






In [None]:
# @title Overview
%%html
<style>
div.info {
    color: #0056b3;
    background-color: #d9edf7;
    border-left: 5px solid #31708f;
    padding: 0.5em;
    font-size: 1.25em; /* A little larger the surrounding text size */
    line-height: 1.5; /* Ensures readability */
}
div.info ul {
    margin: 0.5em 0; /* Space around the list */
}
div.info li {
    margin-bottom: 0.5em; /* Space between list items */
}
</style>

<div class="info">
    <strong>Questions:</strong>
    <ul>
        <li>How can I use Python to read text files?</li>
        <li>How do I sort through all the information in a text file and extract particular pieces of information?</li>
    </ul>

    <strong>Objectives:</strong>
    <ul>
        <li>Open a file and read in its contents line by line.</li>
        <li>Search for a particular string in a file.</li>
        <li>Manipulate strings and change data types.</li>
    </ul>
</div>

## Working with files

One of the most common tasks in research is analyzing data.
Many computational chemistry programs output text files that include a large amount of information including text and data that you need to analyze. Often, you need to sort through the output file and identify particular pieces of information that are most important to you.
In general, this is called "file parsing".

In this notebook, we will learn how to open a text file and use Python to go through the information in the file.
We will be opening a file from the quantum chemistry package Psi4 and looking for the Final Energy that was calculated for the molecule.

The line we will get information from looks like this:

```
 @DF-RHF Final Energy:  -154.09130176573018
```

We know that because we have previewed the file in a text editor. Whenever you are looking for information in a text file, you first have to figure out what that information looks like so that you can write a program for it!

## Putting Files on Google Drive
[Click this link to download files for this lesson](https://github.com/act-cms/foundational-modules/raw/refs/heads/main/intro-programming/intro_programming_data.zip).

After you have downloaded this folder, you should unzip the file to your local computer. On Windows, you should right click the Zip file and choose "Extract All"

Next, navigate to your Google Drive and upload the folder you unzipped (should be named `intro_programming_data`) to Google Drive.


## Accessing Files From Google Drive in Your Notebook

We will need to read files that we have stored on Google Drive.
We will use a few special lines of code to do this.

The cell below will mount your Google Drive so that you can access the files. After you run this code, a pop up window will ask you if you allow Google Colab to access Google Drive.
Choose "yes" to these questions.

<img src="https://github.com/act-cms/foundational-modules/blob/main/intro-programming/colab-notebooks/images/google_colab_file_access.png?raw=true">


In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


When you are done, you will be able to see your Google Drive files by clicking the folder icon to the left.

<img src="https://github.com/act-cms/foundational-modules/blob/main/intro-programming/colab-notebooks/images/view_files_colab.png?raw=true">


## Previewing the File

Click the arrow beside "drive" to expand your view. Next, find the "intro_programming_data" folder and expand it by clicking the arrow button.
Repeat this for the "outfiles" folder, then double click on the file "ethanol.out".

This will open a tab to the right of the notebook where you can preview the file. If you scroll to line `228`, you will see the piece of information we would like to get from this file.


## Getting the File Path

For this section, we will be working with the file `ethanol.out` in the `intro_programming_data/outfiles` directory.

You can find this file by expanding the "drive" folder shown in the image above, then expanding "MyDrive". Then, you should see the `intro_programming` folder that you added in the Workshop Setup instructions.

Find the `ethanol.out` file in the `outfiles` folder. Then, right click the file name and click "Copy Path".
This will give you the "File Path", or the directions to find the file. We will use this in to tell Python where the file is.

The file we will be working with is in the `outfiles` folder and is `ethanol.out`. We will pull a piece of information (the energy) from this file.

We will open the file in the next step, but first, we have to tell Python where the file is.
We will create a variable called `ethanol_file` that contains a string that tells Python where the file is.
This string will have folder names and file names separated by forward slashes (`/`) and is called a "file path" that we copied previously.

When deciding your file path, you can think about what you would tell someone to click in order to find the file.

In [None]:
ethanol_file = "/content/drive/MyDrive/intro_programming_data/outfiles/ethanol.out"
print(ethanol_file)

/content/drive/MyDrive/intro_programming_data/outfiles/ethanol.out


## Reading a file

In Python, there are many ways to read in information from a text file.
The best method to use depends on the type of data and the type of analysis you are performing.
We will use the `open` function to open the file, and another function called `readlines` to pull information out of the file.
If you have a file with lots of different types of information, text and numbers, with different types of formatting, the most generic way to read in information is the `read` or `readlines` function.
Before you can read in a file, you have to open the file using the file path we defined above.
This will create a file object, or filehandle. The file we will be analyzing in this example is a Psi4 output file for a SCF/cc-pVDZ energy calculation for an ethanol molecule.

In Python, when we use the `open` function, the syntax is

```python
with open(filename, open_mode) as variable:
    # read the file
    data = variable.readlines()
```

In the `open` function, we specify the file we want to open as the first argument to the function (`filename` above), followed by the opening mode. The `"r"` specifies that we want to read the file.

Next, we use the `readlines` function. This pulls all of the information from the file into a `list`.
Each element in the list is a line in the list.

In [None]:
## Write your code here to read the file!
with open(ethanol_file, "r") as f:
  data = f.read()

In [None]:
# @title Exercise
%%html
<style>
div.orange-alert {
    color: #854f00; /* Darker shade of orange for text */
    background-color: #ffe6cc; /* Light orange background */
    border-left: 5px solid #ff9933; /* Bright orange border */
    padding: 0.5em;
    font-size: 1.25em; /* Matches the surrounding text size */
    line-height: 1.5; /* Ensures readability */
}
div.orange-alert ul {
    margin: 0.5em 0; /* Space around the list */
}
div.orange-alert li {
    margin-bottom: 0.5em; /* Space between list items */
}
</style>

<div class="orange-alert">

<strong>Check Your Understanding</strong>
<p>
Check that your file was read in correctly by determining how many lines are in the file.
</p>
<p>You should use a function you learned aobut in the last lesson for this.
  </p>
</div>

## Searching for a pattern in your file

The file we opened is an output file which calculates the energy (and a lot of other stuff!) for an ethanol molecule. As stated previously, the readlines() function put the file contents into a list where each element is a line of the file. You may remember from lesson 1 that a for loops can be used to execute the same code repeatedly. As we learned in the previous lesson, we can use a for loop to iterate through elements in a list.

Let’s take a look at what’s in the file.

```python
for line in data:
    print(line)
```

    print(line)

This will print exactly what is in the file.

If you look through the output, you will see that the critical line says “Final Energy”. We want to search through this file and find that line, and print only that line. We can do this using an if statement.

Returning to our file example,


In [None]:
for line in data:
    if 'Final Energy' in line:
        energy_line = line
        print(energy_line)

  @DF-RHF Final Energy:  -154.09130176573018



Remember that `readlines()` saves each line of the file as a string, so `energy_line` is a string that contains the whole line.
For our analysis, if we are most interested in the energy, we need to split up the line so we can save just the number as a different variable name.
To do this, we use a new function called `split`.
The `split` function takes a string and divides it into its components using a delimiter.

The delimiter is specified as an argument to the function (put in the parenthesis ()). If you do not specify a delimiter, a space is used by default. Let’s try this out.

In [None]:
energy_line.split()

['@DF-RHF', 'Final', 'Energy:', '-154.09130176573018']

Or, we an use the colon (`:`) as the delimiter.

In [None]:
## Split energy line on the colon
energy_line.split(":")

['  @DF-RHF Final Energy', '  -154.09130176573018\n']

When we use ‘:’ as the delimiter, a list with two elements is returned. It is split where a colon was found.

We can save the output of this function to a variable as a new list. In the example below, we take the line we found in the `for` loop and split it up into its individual words.

In [None]:
## Split energy_line on whitespace and save as variable called "words"
words = energy_line.split()
words

['@DF-RHF', 'Final', 'Energy:', '-154.09130176573018']


From this `print` statement, we now see that we have a list called `words`, where we have split `energy_line`. The energy is actually the fourth element of this list, so we can now save it as a new variable.

In [None]:
energy = words[3]
print(energy)

-154.09130176573018


In [None]:
# @title Python Negative Indexing
%%html
<style>
div.purple-box {
    color: #4b0082; /* Indigo for text */
    background-color: #f3e5f5; /* Light lavender background */
    border-left: 5px solid #7b1fa2; /* Medium purple border */
    padding: 0.5em;
    font-size: 1.25em; /* Matches the surrounding text size */
    line-height: 1.5; /* Ensures readability */
    font-family: Arial, sans-serif; /* Clean, modern font */
}
div.purple-box ul {
    margin: 0.5em 0; /* Space around the list */
}
div.purple-box li {
    margin-bottom: 0.5em; /* Space between list items */
}
</style>

<div class="purple-box">
    <strong>Python Negative Indexing:</strong>
    <p>
        We also recognize that the energy is the last element of the list. In Python, we can count backwards from the end of the list by using negative numbers. Therefore, an alternative way to assign energy is:
    </p>
    <pre>
energy = words[-1]
    </pre>
</div>


If we now try to do a math operation on energy, we get an error message.
Why do you think that is?

In [None]:
energy + 50

Even though energy looks like a number to us, it is really a string, so we can not add an integer to it. We need to change the data type of energy to a float. This is called casting.

In [None]:
energy = float(energy)

Now our math operation will work. If we thought ahead, we could have changed the data type when we assigned the variable originally.

In [None]:
energy = float(words[3])

In [5]:
# @title Final Challenge
%%html
<style>
div.orange-alert {
    color: #854f00; /* Darker shade of orange for text */
    background-color: #ffe6cc; /* Light orange background */
    border-left: 5px solid #ff9933; /* Bright orange border */
    padding: 0.5em;
    font-size: 1.25em; /* Matches the surrounding text size */
    line-height: 1.5; /* Ensures readability */
}
div.orange-alert ul {
    margin: 0.5em 0; /* Space around the list */
}
div.orange-alert li {
    margin-bottom: 0.5em; /* Space between list items */
}
</style>
<div class="orange-alert">
    <strong>Exercise on File Parsing:</strong>
    <p>
        The file <code>03_Prod.mdout</code> is an output file from an Amber molecular dynamics simulation.
        Read in the file, and pull out all of the total energy values (<code>Etot</code>). Save the values in a list (don't forget to cast them to floating point numbers!).
    </p>

    <p>Remember to preview the file before you start writing code so you can come up with a strategy for pulling information from the file!</p>
</div>


In [None]:
# @title Key Points
%%html
<style>
div.green-note {
    color: #155724; /* Dark green for text */
    background-color: #d4edda; /* Light green background */
    border-left: 5px solid #28a745; /* Bright green border */
    padding: 0.5em;
    font-size: 1.25em; /* Consistent with text size */
    line-height: 1.5; /* Ensures readability */
    font-family: Arial, sans-serif; /* Clean and modern font */
}
div.green-note ul {
    margin: 0.5em 0; /* Space around the list */
}
div.green-note li {
    margin-bottom: 0.5em; /* Space between list items */
}
</style>

<div class="green-note">
    <strong>Key Points:</strong>
    <ul>
        <li>Files have locations called file paths that tell a computer where the file is saved.</li>
        <li>To open a file, use <code>with open(filename)</code>.</li>
        <li>Get contents of a text file in a variable using <code>readlines</code>.</li>
        <li>You can search through the lines in your file by using a <code>for</code> loop to go through each line and an <code>if</code> statement to look for a pattern.</li>
    </ul>
</div>
