# Data Preprocess

This Notebook instance provides a procedure to pre-process channel 7 Diviner data collected between January 2010 - September 2023 as part of a goal to replicate the work published in [Unsupervised Learning for Thermophysical Analysis on the Lunar Surface](https://iopscience.iop.org/article/10.3847/PSJ/ab9a52) by Moseley et al. (2020).

A particular objective of this pre-processing notebook is to use only a standard computer (CPU, multi-threading) with augmented storage space (~5TB).

## Import Required Libraries

In [6]:
from diviner_tools import DivinerTools
from datetime import datetime
import os

## Constants

In [2]:
# The parent directory for relevant data products
# Note: Esthar has 10TB
DATA_DIR = '/esthar/diviner_data'

# The name for our database object
DB_NAME = 'diviner_data.db'

# The filepath for the database
DB_FILEPATH = os.path.join(DATA_DIR, "database", DB_NAME)

# The filepath where tab files will be temporarily saved
TAB_DIR = os.path.join(DATA_DIR, "tab_files")

# Filepath for a text file with all .zip file links
ZIP_URLS_FILE = os.path.join(DATA_DIR, "txt_files", "zip_urls.txt")

# Filepath for a text file with .TAB filenames that contain target data
USEFUL_TAB_FILE = os.path.join(DATA_DIR, "txt_files", "useful_tabs.txt")

## Init Diviner Tools

Upon initialization of the Diviner Tools object, it will create the data directory and database if they don't already exist.

In [7]:
dt = DivinerTools(DATA_DIR, DB_FILEPATH)

## Extract Zip URLs

We will extract the URLs for the zip files which contain tab files that contain the RDR LVL1 tables. This can be a somewhat slow process, so we will do this once and then save the urls to a text file.

Each year takes approximately 30-45 seconds, so total time should be around 5-10 minutes.

In [None]:
zip_urls = dt.find_all_zip_urls()

print("Found " + repr(len(zip_urls)) + " .zip file urls")

In [None]:
# Save urls to file
append_to_file(ZIP_URLS_FILE, zip_urls)

## Download, Extract, Filter, and Delete

We will download zip files in small batches, extract the tab files, delete the original zip file, filter the data, write filtered data to a database, then delete the tab file. The filename of any tab files that contain filtered data will be saved so that in the future if we need to re-do this process we can focus on the files that contain relevant data.

In [4]:
tmp = dt.txt_to_list(ZIP_URLS_FILE)

In [5]:
tmp[0:10]

['https://pds-geosciences.wustl.edu/lro/lro-l-dlre-4-rdr-v1/lrodlr_1001/data/2010/201001/20100101/201001010000_rdr.zip',
 'https://pds-geosciences.wustl.edu/lro/lro-l-dlre-4-rdr-v1/lrodlr_1001/data/2010/201001/20100101/201001010010_rdr.zip',
 'https://pds-geosciences.wustl.edu/lro/lro-l-dlre-4-rdr-v1/lrodlr_1001/data/2010/201001/20100101/201001010020_rdr.zip',
 'https://pds-geosciences.wustl.edu/lro/lro-l-dlre-4-rdr-v1/lrodlr_1001/data/2010/201001/20100101/201001010030_rdr.zip',
 'https://pds-geosciences.wustl.edu/lro/lro-l-dlre-4-rdr-v1/lrodlr_1001/data/2010/201001/20100101/201001010040_rdr.zip',
 'https://pds-geosciences.wustl.edu/lro/lro-l-dlre-4-rdr-v1/lrodlr_1001/data/2010/201001/20100101/201001010050_rdr.zip',
 'https://pds-geosciences.wustl.edu/lro/lro-l-dlre-4-rdr-v1/lrodlr_1001/data/2010/201001/20100101/201001010100_rdr.zip',
 'https://pds-geosciences.wustl.edu/lro/lro-l-dlre-4-rdr-v1/lrodlr_1001/data/2010/201001/20100101/201001010110_rdr.zip',
 'https://pds-geosciences.wustl.

## Scratch Code

In [None]:
lines = tab_to_lines('/esthar/diviner_data/201001010000_RDR.TAB')

In [None]:
print(len(lines))

In [None]:
count = 0  

for line in lines:
    count += insert_into_database(DB_FILEPATH, line)

print(repr(count) + " lines added to database")

In [None]:
lines = tab_to_lines('/esthar/diviner_data/201001010010_RDR.TAB')

In [None]:
print(len(lines))

In [None]:
count = 0  

for line in lines:
    count += insert_into_database(DB_FILEPATH, line)

print(repr(count) + " lines added to database")

## 2010

First, find all the .zip file urls for the year 2010

In [None]:
zip_urls = find_all_zip_urls('2010')

print("Found " + repr(len(zip_urls)) + " .zip file urls")

In [None]:
append_to_file(ZIP_URLS_FILE, zip_urls)

Download the zip files, unpack the .TAB file, and then delete the original zip file. Note that, each .TAB file seems to be about 289M. If we downloaded all 51,483 .TAB files from 2010 alone, that would require just over 14TB of storage. 

In [None]:
download_unpack_delete(DATA_DIR, zip_urls[1])

Insert the .TAB data into the local database, then delete the original .TAB file

In [None]:
# TBD

Delete data points that don't meet the parameter criteriae.

In [None]:
# TBD