# Data Preprocess

This Notebook instance provides a procedure to pre-process channel 7 Diviner data collected between January 2010 - September 2023 as part of a goal to replicate the work published in [Unsupervised Learning for Thermophysical Analysis on the Lunar Surface](https://iopscience.iop.org/article/10.3847/PSJ/ab9a52) by Moseley et al. (2020).

A particular objective of this pre-processing notebook is to use only a standard computer (CPU, multi-threading) with augmented storage space (~5TB).

## Import Required Libraries

In [None]:
from diviner_tools import DivinerTools

## Constants

In [None]:
# Pathway to config file
CFG_FILEPATH = "/Notebooks/Moseley/diviner-tools/support/config/cfg.yaml"

# Pathway to pre-collected zip file URLs list
ZIP_FILEPATH = "/esthar/diviner_data/txt_files/zip_urls.txt"

## Init Diviner Tools

diviner_tools is a custom library developed specifically for this task. Upon initialization of the Diviner Tools object, it will create the data directory and database if they don't already exist.

In [None]:
dt = DivinerTools(CFG_FILEPATH)

## Preprocess

Preprocessing will involve:
* Splitting the zip file URLs into batches
* For each url, download the .zip file to local directory
* Unpack the .zip file
* Read the lines from the unpacked .TAB file
* Check each line against desired criteria (activity flag, geoemetry flag, etc)
* If a line meets the desired criteria, write it to our database
* If a .TAB file contains data that was written to the database, save the filename to a textfile
* Delete the .TAB file

Since there is a lot of data to process which may take a long period of time, we will split the 717,509 URLs into parent batches of 100,000 each and will manually start each 100,000 master batch. 

In [None]:
all_urls = dt.txt_to_list(ZIP_FILEPATH)

In [None]:
# Master batches
master_batches = dt.batch(all_urls, 100000)

### Preprocessing Loop

We will start each branch manually. This could be automated, but doing this piece by piece helps mitigate risks with something interrupting a batch. The database will be backed up manually between master batch sessions.

In [None]:
dt.preprocess(master_batches[0])
dt.preprocess(master_batches[1])
dt.preprocess(master_batches[2])
dt.preprocess(master_batches[3])
dt.preprocess(master_batches[4])
dt.preprocess(master_batches[5])
dt.preprocess(master_batches[6])
dt.preprocess(master_batches[7])