# 5b Data Pre-Processing Implementation

Recall that at the start of each new notebook you will need to import all libraries again.

✏️ Import the numpy and pandas libraries. In addition, import your newly created C317 library.

In [19]:
import numpy as np
import pandas as pd
import C317
from C317 import Narrow, Normalise, Interpolate

You now know how to import a .txt file into a DataFrame. In the `IR Data` folder, there are 380 text files, for `380/5=76` different chemicals, i.e. there are 5 repeated spectra for each compound.
<br>
<br>Writing out a new command to read each .txt file in turn, `pd.read_csv(...)`, would be very consuming for each of these 380 files. It would be more useful to be able to automate this process.

✏️ **Without explicitly writing it out,** create a list, 5 items long, containing the **file names** (strings) of the 5 repeats of m-anisaldehyde (`["m-anisaldehyde_1", "m-anisaldehyde_2", ...]`)

*Hint: This can be accomplished using a `for` loop. The `.append()` method of a list may well be useful. You may also find formatted strings or string addition helpful (refer back to Notebook 2 for more information).*

In [9]:
files=[f"m-anisaldehyde_{i}" for i in range(1,6)]
print(files)

['m-anisaldehyde_1', 'm-anisaldehyde_2', 'm-anisaldehyde_3', 'm-anisaldehyde_4', 'm-anisaldehyde_5']


✏️ Adapt this code to make a list in which each entry is a **DataFrame** (using `read_csv()`) of the 5 repeats of m-anisaldehyde.

**Note:**
>For clarity, make the column name of each DataFrame the same as the file name (instead of "% Transmittance" as was used previously).

Use the `.head()` method to confirm that there are five different DataFrames contained in your list.

In [12]:
dataframes=[pd.read_csv(f"data/{file}.txt",skiprows=4,delimiter="\s+",names=[file],index_col=0) for file in files]
dataframes[0].head()


Unnamed: 0,m-anisaldehyde_1
399.826377,92.012424
400.183684,94.958885
400.540991,97.443201
400.898298,97.642822
401.255605,95.77448


We would now like to create a list with DataFrames for *all* of the (380) IR spectra in `IR Data`. Even with the increase in efficiency we have found above, this would be tedious.
<br>
<br>Python has a library called `os` (stands for 'operating system'), which lets you access the files on your computer. Rather than writing out all the names of the chemicals, we can get `os` to look in the folder where they're stored, and list them all.

✏️ Import the `os` library.

In [13]:
import os

The `os` library has a function called `scandir` which **scan**s a particular **dir**ectory (folder). It takes one argument: the directory you want to scan. For example if, in the folder that this Notebook is in, you had a folder called `Chemicals`, you could 'scan' it with `os.scandir("Chemicals")`.

✏️ Scan the directory with the IR data text files in. What does this return?

In [14]:
os.scandir('Data')

<nt.ScandirIterator at 0x1aeba065800>

The `scandir` function returns something called a `ScandirIterator`. The word 'iterator' might suggest to you that a for loop could be helpful.

✏️ Try iteratively printing the contents of the `ScandirIterator` object as output to see what happens.

In [15]:
for file in os.scandir('Data'):
    print(file)

<DirEntry 'm-anisaldehyde_1.txt'>
<DirEntry 'm-anisaldehyde_2.txt'>
<DirEntry 'm-anisaldehyde_3.txt'>
<DirEntry 'm-anisaldehyde_4.txt'>
<DirEntry 'm-anisaldehyde_5.txt'>


The output should give a series of `DirEntry`s, which seem to be all of the files in the directory/folder. So the `ScandirIterator` seems to contain the identities of all of the files in the directory. Each `DirEntry` object comes with an **attribute**, `name`.

✏️ Using a similar loop, generate a *list* of all the file names (you may want to use `.append()`). Print your list to confirm it has been created succesfully.
<br>
<br>
**Note:**
>Due to a bug/feature of `scandir()`, you may need to re-load (with `scandir()`) the folder before iterating through the files again*

In [16]:
filenames=[file.name for file in os.scandir('Data')]
print(filenames)

['m-anisaldehyde_1.txt', 'm-anisaldehyde_2.txt', 'm-anisaldehyde_3.txt', 'm-anisaldehyde_4.txt', 'm-anisaldehyde_5.txt']


By reading through the contents of the list, you will hopefully appreciate why this method is important to employ. This time taken to type out all of the file names manually and without error would be considerable.

✏️ Using your new list of all the .txt file names, create a list in which each entry is a separate DataFrame for each spectrum in `IR Data`. Print the fifth DataFrame in your list to confirm its contents.

In [18]:
IRData=[pd.read_csv(f"data/{file}",skiprows=4,delimiter="\s+",names=[file],index_col=0) for file in filenames]
IRData[0].head()

Unnamed: 0,m-anisaldehyde_1.txt
399.826377,92.012424
400.183684,94.958885
400.540991,97.443201
400.898298,97.642822
401.255605,95.77448


✏️ Now alter your code so that you pre-process the spectral data in each DataFrame, i.e. run it through the three functions in your library, before adding it to the list. Print the fifth DataFrame again, and make sure it's changed - it should start at 630 $cm^{-1}$ and have transmittance values much less than 1.

**Note:**
>Recall that the pre-processing functions should be applied in the order: normalise → interpolate → narrow.
>
>The import and processing of all spectral data will take a little time. Be patient.

In [69]:
IRData=[Narrow(Normalise(Interpolate(pd.read_csv(f"data/{file}",skiprows=4,delimiter="\s+",names=[file],index_col=0)))) for file in filenames]
IRData[0].head()


Unnamed: 0,m-anisaldehyde_1.txt
630,0.000264
631,0.000265
632,0.000262
633,0.000258
634,0.000255


In the previous notebook, you used the `concat()` function of the `pandas` library to join to DataFrames together into one long column, prior to sorting. It is also possible to join two DataFrames along the other axis to produce a two column DataFrame. This is done by specifying the argument `axis=1`.

✏️ From from your list of DataFrames, create a single DataFrame containing all the spectra as individual columns. It should be 251 rows × 380 columns.

In [None]:
x=IRData.pop(0)
while IRData!=[]:
    x=pd.concat([x,IRData.pop(0)],axis=1)
x.head()

     m-anisaldehyde_1.txt
630              0.000264
631              0.000265
632              0.000262
633              0.000258
634              0.000255
..                    ...
876              0.000238
877              0.000238
878              0.000237
879              0.000238
880              0.000239

[251 rows x 1 columns]


Unnamed: 0,m-anisaldehyde_1.txt,m-anisaldehyde_2.txt,m-anisaldehyde_3.txt,m-anisaldehyde_4.txt,m-anisaldehyde_5.txt
630,0.000264,0.000265,0.000264,0.000267,0.000265
631,0.000265,0.000266,0.000263,0.000265,0.000264
632,0.000262,0.000262,0.000263,0.000264,0.000263
633,0.000258,0.000258,0.000258,0.00026,0.000259
634,0.000255,0.000254,0.000256,0.000256,0.000258


✏️ Add the code you have just written into a function in your library, called `load_spectra()`. The function should returns a DataFrame just like the one above. Call your function to check it works.

In [74]:
#This code reloads your library so that you can use the function you just added.
import importlib
importlib.reload(C317)
C317.load_spectra()

Unnamed: 0,m-anisaldehyde_1,m-anisaldehyde_2,m-anisaldehyde_3,m-anisaldehyde_4,m-anisaldehyde_5
630,0.000264,0.000265,0.000264,0.000267,0.000265
631,0.000265,0.000266,0.000263,0.000265,0.000264
632,0.000262,0.000262,0.000263,0.000264,0.000263
633,0.000258,0.000258,0.000258,0.000260,0.000259
634,0.000255,0.000254,0.000256,0.000256,0.000258
...,...,...,...,...,...
876,0.000238,0.000237,0.000239,0.000237,0.000235
877,0.000238,0.000235,0.000241,0.000240,0.000235
878,0.000237,0.000237,0.000244,0.000239,0.000239
879,0.000238,0.000239,0.000240,0.000239,0.000241


---