# Data Preprocess

This Notebook instance provides a procedure to pre-process channel 7 Diviner data collected between January 2010 - September 2023 as part of a goal to replicate the work published in [Unsupervised Learning for Thermophysical Analysis on the Lunar Surface](https://iopscience.iop.org/article/10.3847/PSJ/ab9a52) by Moseley et al. (2020).

A particular objective of this pre-processing notebook is to use only a standard computer (CPU, multi-threading) with augmented storage space (~5TB).

## Import Required Libraries

In [1]:
from diviner_tools import DivinerTools
from datetime import datetime, timedelta
import os
import pytz
import yaml

## Constants

In [2]:
# Pathway to config file
cfg_filepath = "/Notebooks/Moseley/diviner-tools/support/config/cfg.yaml"

with open(cfg_filepath, 'r') as file:
    cfg = yaml.safe_load(file)

# Filepath to database
DB_FILEPATH = cfg['database_filepath']

# Pathway to tmp directory
TMP_DIR = cfg['tmp_directory']

# Filepath to zip list
ZIP_FILEPATH = cfg['zip_filepath']

# Filepath to the useful tab files list
USEFUL_TABS_FILEPATH = cfg['useful_tabs_filepath']

# Batch size for number of .TAB files processed per iteration
BATCH_SIZE = cfg['batch_size']

## Init Diviner Tools

diviner_tools is a custom library developed specifically for this task. Upon initialization of the Diviner Tools object, it will create the data directory and database if they don't already exist.

In [3]:
dt = DivinerTools(DB_FILEPATH)

## Preprocess

Preprocessing will involve:
* Splitting the zip file URLs into batches
* For each url, download the .zip file to local directory
* Unpack the .zip file
* Read the lines from the unpacked .TAB file
* Check each line against desired criteria (activity flag, geoemetry flag, etc)
* If a line meets the desired criteria, write it to our database
* If a .TAB file contains data that was written to the database, save the filename to a textfile
* Delete the .TAB file

In [4]:
all_urls = dt.txt_to_list(ZIP_FILEPATH)

In [5]:
batch = all_urls[11:21]

In [6]:
# Log start time
start_t = datetime.now(pytz.timezone('America/Montreal'))
print("Start time: " + repr(start_t.strftime('%Y-%m-%d %H:%M')))

Start time: '2023-12-31 02:17'


In [7]:
dt.preprocess(batch, TMP_DIR, USEFUL_TABS_FILEPATH)

There are 265 jobs leftttt[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[

In [8]:
# Log end time
end_t = datetime.now(pytz.timezone('America/Montreal'))
print("End time: " + repr(end_t.strftime('%Y-%m-%d %H:%M')))

# Total elapsed time
delta_t = end_t - start_t

# Calculate total seconds in the timedelta
total_seconds = int(delta_t.total_seconds())

# Extract hours, minutes, and seconds
hours, remainder = divmod(total_seconds, 3600)
minutes, seconds = divmod(remainder, 60)

# Format the output as HH:mm:ss
formatted_time = f"{hours:02}:{minutes:02}:{seconds:02}"

print("Elapsed time: " + formatted_time)

End time: '2023-12-31 02:58'
Elapsed time: 00:40:35
Job monitor stopped
