# Data Preprocess

This Notebook instance provides a procedure to pre-process channel 7 Diviner data collected between January 2010 - September 2023 as part of a goal to replicate the work published in [Unsupervised Learning for Thermophysical Analysis on the Lunar Surface](https://iopscience.iop.org/article/10.3847/PSJ/ab9a52) by Moseley et al. (2020).

A particular objective of this pre-processing notebook is to use only a standard computer (CPU, multi-threading) with augmented storage space (~5TB).

## Import Required Libraries

In [1]:
from diviner_tools import DivinerTools
from datetime import datetime
import os

## Constants

In [2]:
# The parent directory for relevant data products
# Note: Esthar has 10TB
DATA_DIR = '/esthar/diviner_data'

# The name for our database object
DB_NAME = 'diviner_data.db'

# The filepath for the database
DB_FILEPATH = os.path.join(DATA_DIR, "database", DB_NAME)

# The filepath where tab files will be temporarily saved
TAB_DIR = os.path.join(DATA_DIR, "tab_files")

# Filepath for a text file with all .zip file links
ZIP_URLS_FILE = os.path.join(DATA_DIR, "txt_files", "zip_urls.txt")

# Filepath for a text file with .TAB filenames that contain target data
USEFUL_TAB_FILE = os.path.join(DATA_DIR, "txt_files", "useful_tabs.txt")

## Init Diviner Tools

diviner_tools is a custom library developed specifically for this task. Upon initialization of the Diviner Tools object, it will create the data directory and database if they don't already exist.

In [3]:
dt = DivinerTools(DATA_DIR, DB_FILEPATH)

## Extract Zip URLs

We will extract the URLs for the zip files which contain tab files that contain the RDR LVL1 tables. This can be a somewhat slow process, so we will do this once and then save the urls to a text file. Skip this step if the ZIP_URLS_FILE already exists and has been populated. We expect 717,509 URLs.

Each year takes approximately 30-45 seconds, so total time should be around 5-10 minutes.

In [None]:
zip_urls = dt.find_all_zip_urls()

print("Found " + repr(len(zip_urls)) + " .zip file urls")

In [None]:
# Save urls to file
append_to_file(ZIP_URLS_FILE, zip_urls)

## Preprocess

Preprocessing will involve:
* Splitting the zip file URLs into batches
* For each url, download the .zip file to local directory
* Unpack the .zip file
* Read the lines from the unpacked .TAB file
* Check each line against desired criteria (activity flag, geoemetry flag, etc)
* If a line meets the desired criteria, write it to our database
* If a .TAB file contains data that was written to the database, save the filename to a textfile
* Delete the .TAB file

In [4]:
tmp = dt.txt_to_list(ZIP_URLS_FILE)

In [5]:
batch = tmp[8:10]

In [6]:
dt.startDatabaseJobMonitor(DB_FILEPATH)

In [7]:
import concurrent.futures

start_t = datetime.now()

with concurrent.futures.ThreadPoolExecutor() as executor:

    futures = [executor.submit(dt.processor, url, TAB_DIR, DB_FILEPATH, USEFUL_TAB_FILE) for url in batch]

    # Wait for all futures to complete 
    results = [future.result() for future in concurrent.futures.as_completed(futures)]

end_t = datetime.now()

print("Time to complete: " + repr(end_t - start_t))

Time to complete: datetime.timedelta(seconds=220, microseconds=633145)


In [8]:
dt.stopDatabaseJobMonitor()

There are 370 jobs leftttt[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[KJob monitor stopped
