# PlottingSpectralData

Version 0.2 Mar 2020

This notebook combines some of the techniques from the Skills Week 19 notebooks to list, select and plot spectral data from a range of files downloaded from the NIST database, as described in Skills Week 19 Section 3.5.

https://webbook.nist.gov/

This notebook assumes that the data files are in a subfolder `C:\OU\SXPS288\DataFiles\ReferenceSpectra`.   Modify as necessary to point to the correct location on your computer.

The solution in the first set of cells in this notebook illustrates the process of putting together a complete application to select, read, process and plot the data in these files, working one cell at a time.  This step-by-step approach is helpful when developing a program - allowing you to check the output at each stage before going on to write the next part of your program.

The last cell in the notebook contains a complete solution in one cell. 


## 0 Imports

As usual, we will start by making the necessary imports.  

Remember to run this cell first, before any of the subsequent cells.

In [None]:
# First import the widgets and display modules
%matplotlib inline

# glob for listing files
import glob

# ipywidgets for interactive controls
import ipywidgets as widgets
from IPython.display import display, clear_output

# Also import numpy for generating data and Matplotlib for plotting
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import signal

#Set the size of subsequent Matplotlib plots
plt.rcParams['figure.figsize'] = [12, 7.5]

## 1 Listing spectra files with `glob`

To start with, let's use `glob` to make a list of all the NIST spectra files in the chosen folder. 

In Windows, pathnames start with a drive letter. Macintosh pathnames work in a similar way, but without the drive letter.

One quirk of Python is that the backslash character ' \\ ' , which is used in Windows as part of a pathname, has a special meaning (it is used as an escape character), and so can't be included in a pathname.  To specify a Windows pathname in Python, use the forward slash character ' / ' instead. 

This example supposes that you have a folder `C:\OU\SXPS288` set up to hold all of your work on SXPS288, and within that you have a folder structure for the different investigations, including a folder `C:\OU\SXPS288\DataFiles\ReferenceSpectra` containing some spectra files that you have downloaded.

(On a Macintosh, the corresponding pathname might be something like: `Users\yourname\OU\SXPS288\DataFiles\ReferenceSpectra`)

The spectra files end with `.jdx` so we can select only those files by specifying this part of the search string.  `glob` is the best choice to do this, since it allows you to specify a wildcard search string as part of the pathname.

In [None]:
# Glob allows you to add a wildcard to the pathname.  Only files matching the pattern will be listed.
# Modify the pathname in this example to match the location of the files on your computer

strPath = "C:/OU/SXPS288/DataFiles/ReferenceSpectra/"
nPathLen = len(strPath)

#Search for .jdx files.
dirlist = glob.glob(strPath + "*.jdx")

# Unlike os.listdir(), glob() returns the entire pathname
# To get just the file names, this loop removes the first part of the pathname before printing
# Try printing pathname instead to see the actual strings returned by glob()
for pathname in dirlist:
    filename = pathname[nPathLen:]
    print(filename)

### 2. Select file to plot using ipywidget

Now we'll use a widget to list the files and allow the user to select one to plot.  This can be done using an `ipywdigets` dropdown box.

In [None]:
# Define the list of options to display
# Just want to show the filename, not the complete path, so chop off path name here
lstOptions = dirlist.copy()

for ix in range (len(dirlist)):
    lstOptions[ix] = dirlist[ix][nPathLen:]

def on_selection_change(change):
    with output_2:
        clear_output()
        print(f"Selected file: {drpFile.value}")

drpFile = widgets.Dropdown(options = lstOptions, description = "File:")

output_2 = widgets.Output()

display(drpFile, output_2)

# Call function each time the value is changed
drpFile.observe(on_selection_change, names = "value")

The next cell is simply a check that we can retrieve the selected value from the dropdown box

In [None]:
strSelectedFile = strPath + drpFile.value
print(strSelectedFile)

## 3 Open file and plot contents

In order to open the data files and read the contents into a Pandas DataFrame ready for processing and plotting, we need to understand the file format and contents.  The best way to do this is to open the files in a text editor and inspect the contents visually. 

Do this now with some of the `.jdx` files that you have downloaded. 

You will find that the files start with a number of header lines beginning with "##", followed by the data, in the form of a number of columns.  There are no column headers, but the first column contains wavenumber values (in $\text{cm}^{-1}$), with the subsequent columns containing intensity values (typically 5 repeats).  Crucially, the number of header lines is not fixed - it can be different in different files, so our program needs to take this into account. 

Since the `.jdx` files have variable numbers of headers, we need to read the headers first and determine the number of lines to discard before the data starts.


In [None]:
## first, open the file and read all the lines into a list
f = open(strSelectedFile)
lstLines = f.readlines()
f.close()

## The first task is to determine the number of header lines
## Now iterate through the list to find the last line beginning with "##" that is NOT "##End"

nLastHeaderLine = 0
strTitle = ""
strYlabel = ""

## Read through the file to find the number of header lines
## Some of the lines contain useful information, such as the name of the substance
## or the units, so we will make a note of these as they are found

for ix in range((len(lstLines))):
    if (lstLines[ix][:8] == "##TITLE="):
        strTitle = lstLines[ix][8:]
    if (lstLines[ix][:9] == "##YUNITS="):
        strYlabel = lstLines[ix][9:]
    if (lstLines[ix][:2] == "##") and (lstLines[ix][:5] != "##END"):
        nLastHeaderLine = ix+1
        
print(f"File {strTitle} has {nLastHeaderLine} header lines")

## Now, can read dataset using Pandas
df_Spectrum = pd.read_csv(strSelectedFile, skiprows = nLastHeaderLine, delimiter = "\s+", 
                          names = ["Wavenumber", "A", "B", "C", "D", "E"])

## Average the values in the five data columns to get a single averaged spectrum
df_Spectrum["Mean"] = df_Spectrum["A"]+df_Spectrum["B"]+df_Spectrum["C"]+df_Spectrum["D"]+df_Spectrum["E"]
df_Spectrum["Mean"] /= 5
df_Spectrum.head()
        

In [None]:
## Reindex and plot

df_Spectrum.set_index("Wavenumber", inplace = True, drop = False)

plt.plot(df_Spectrum["Wavenumber"], df_Spectrum["Mean"])
ax = plt.gca()
ax.set_xticks([0, 100, 200, 300, 400, 500, 600, 700])
plt.xlabel("Wavenumber (cm$^{-1}$)")
plt.ylabel(strYlabel)
plt.title(strTitle)
plt.show()

## 4 Complete solution
Having written and debugged the program in the form of indvidual cells, we can now put everything together into a complete program in a single cell.  We'll take the opportunity as we do this to make use of a dictionary suggested in step 4 - so that the dropdown box can list the name of the gas (from the dataset header) rather than just the filename.

The program first uses `glob` to generate a list of files in the chosen location.  For each file in this list, the function `GetHeaderData()` reads the file and builds a list containing relevant information, such as the number of header lines, the title, the full filename, and the y axis label.  This list is returned and added to the dictionary, with the substance name as the key.  

The completed dictionary is then used to generate a dropdown box using `interact`.  When a value is selected, the corresponding list is passed to the `DisplaySpectrum()` function, which plots the labelled spectrum.

This solution illustrates the power of functions - all of the heavy lifting is carried out by the two functions, which can be independently developed and tested. The main program is very short and simply calls the relevant functions to do the work. 

The lists of header data and the dictionary form a _data structure_ that holds information about the spectrum in each file.  Very often, defining the data structure is as important a part of writing a program as working out the algorithmic steps.

This cell can be run as a complete solution, without having to run any of the previous cells.

In [None]:
################################################
## Zone 0 - Imports
################################################

%matplotlib inline

# glob for listing files
import glob

# ipywidgets for interactive controls
import ipywidgets as widgets
from IPython.display import display, clear_output

# Also import numpy for generating data and Matplotlib for plotting
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import signal

#Set the size of subsequent Matplotlib plots
plt.rcParams['figure.figsize'] = [12, 7.5]

################################################
## Zone 1 - define functions
################################################

def GetHeaderData(strPathName):
    # define and clear variables
    nHeaderLines = 0
    strTitle = ""
    strYLabel = ""
    lstHeaderData = []
        
    # Open file and read lines
    f = open(strPathName)
    lstLines = f.readlines()
    f.close()
    
    for ix in range((len(lstLines))):
        if (lstLines[ix][:8] == "##TITLE="):
            strTitle = lstLines[ix][8:-1]
        if (lstLines[ix][:9] == "##YUNITS="):
            strYLabel = lstLines[ix][9:-1]
        if (lstLines[ix][:2] == "##") and (lstLines[ix][:5] != "##END"):
            nHeaderLines = ix+1
        
    lstHeaderData.append(strTitle)
    lstHeaderData.append(strYLabel)
    lstHeaderData.append(nHeaderLines)
    lstHeaderData.append(strPathName)
    
    return lstHeaderData

def DisplaySpectrum(Spectrum):
    # Extract values from the list
    strTitle = Spectrum[0]
    strYlabel = Spectrum[1]
    strSelectedFile = Spectrum[3]
    nLastHeaderLine = Spectrum[2]
    
    # Read data from the file
    df_Spectrum = pd.read_csv(strSelectedFile, skiprows = nLastHeaderLine, delimiter = "\s+", 
                          names = ["Wavenumber", "A", "B", "C", "D", "E"])
    
    ## Average the values in the five data columns to get a single averaged spectrum
    df_Spectrum["Mean"] = df_Spectrum["A"]+df_Spectrum["B"]+df_Spectrum["C"]+df_Spectrum["D"]+df_Spectrum["E"]
    df_Spectrum["Mean"] /= 5
    
    ## Reindex and plot
    df_Spectrum.set_index("Wavenumber", inplace = True, drop = False)
    
    plt.rcParams['figure.figsize'] = [12, 7.5]

    plt.plot(df_Spectrum["Wavenumber"], df_Spectrum["Mean"])
    ax = plt.gca()
    ax.set_xticks([0, 100, 200, 300, 400, 500, 600, 700])
    plt.xlabel("Wavenumber (cm$^{-1}$)")
    plt.ylabel(strYlabel)
    plt.title(strTitle)
    plt.show()

################################################
## Zone 2 - Define variables
################################################
strPath = "C:/OU/SXPS288/DataFiles/ReferenceSpectra/"
dictFileInfo = {}

################################################
## Zone 3 - Main body of program
################################################

## Search for .jdx files.
dirlist = glob.glob(strPath + "*.jdx")

## Iterate through list and get number of header lines, plus header information
## from each file.  This comes back in the form of a list that can be added to 
## the dictionary.
for pathname in dirlist:
    lstHeader = GetHeaderData(pathname)
    dictFileInfo[lstHeader[0]] = lstHeader
    
## Interactive widget to select and display spectrum
## Passing a dict to widgets.interact automatically 
## creates a dropdown box with the keys as the options
## and which returns the list corresponding to the selected key
widgets.interact(DisplaySpectrum, Spectrum = dictFileInfo)
