# Data Preprocess

This Notebook instance provides a procedure to pre-process channel 7 Diviner data collected between January 2010 - September 2023 as part of a goal to replicate the work published in [Unsupervised Learning for Thermophysical Analysis on the Lunar Surface](https://iopscience.iop.org/article/10.3847/PSJ/ab9a52) by Moseley et al. (2020).

A particular objective of this pre-processing notebook is to use only a standard computer (CPU, multi-threading) with augmented storage space (~5TB).

#### Import Required Libraries

In [1]:
from diviner_tools import DivinerTools

#### Constants

In [2]:
# Pathway to config file
CFG_FILEPATH = "/Notebooks/Moseley/diviner-tools/support/config/cfg.yaml"

# Pathway to pre-collected zip file URLs list
ZIP_FILEPATH = "/esthar/diviner_data/txt_files/zip_urls.txt"

#### Init Diviner Tools

diviner_tools is a custom library developed specifically for this task. Upon initialization of the Diviner Tools object, it will create the data directory and database if they don't already exist.

In [3]:
dt = DivinerTools(CFG_FILEPATH)

## Preprocess

Preprocessing will involve:
* Splitting the zip file URLs into batches
* For each url, download the .zip file to local directory
* Unpack the .zip file
* Read the lines from the unpacked .TAB file
* Check each line against desired criteria (activity flag, geoemetry flag, etc)
* If a line meets the desired criteria, write it to our database
* If a .TAB file contains data that was written to the database, save the filename to a textfile
* Delete the .TAB file

Since there is a lot of data to process which may take a long period of time, we will split the 717,509 URLs into parent batches of 100,000 each and will manually start each 100,000 master batch. 

In [4]:
lines = dt.tab_to_lines("/esthar/diviner_data/tmp/201001010120_RDR.TAB")

In [11]:
count = 0
ok_lines = []

for line in lines:
    # Split the line    
    values = line.strip().split(',')

    # Remove any whitespaces from the values
    values = [val.strip() for val in values]

    # Check that the data conforms to desired params
    dataok = dt.check_params(values)

    if (dataok):
        ok_lines.append(line)
        count += 1

print(count)

98448


In [105]:
from enum import Enum

# Enum for data fields
FIELD = Enum("FIELD",
	["DATE", "UTC", "JDATE", "ORBIT", "SUNDIST",
	"SUNLAT", "SUNLON", "SCLK", "SCLAT", "SCLON",
	"SCRAD", "SCALT", "EL_CMD", "AZ_CMD", "AF",
	"ORIENTLAT", "ORIENTATION", "C", "DET", "VLOOKX",
	"VLOOKY", "VLOOKZ", "RADIANCE", "TB", "CLAT",
	"CLON", "CEMIS", "CSUNZEN", "CSUNAZI", "CLOCTIME",
	"QCA", "QGE", "QMI"], 
	start=0)

FIELD_LIST = list(FIELD)

tmp = ok_lines[0]

# Split the line    
values = tmp.strip().split(',')

# Remove any whitespaces from the values
values = [val.strip() for val in values]

index = 0
for val in values:
    print(repr(FIELD_LIST[index].name) + ": " + val)
    index += 1

job_values = [
    values[FIELD.DATE.value], values[FIELD.UTC.value], float(values[FIELD.JDATE.value]),
	float(values[FIELD.ORBIT.value]), float(values[FIELD.SUNDIST.value]), float(values[FIELD.SUNLAT.value]),
	float(values[FIELD.SUNLON.value]), float(values[FIELD.SCLK.value]), float(values[FIELD.SCLAT.value]),
	float(values[FIELD.SCLON.value]), float(values[FIELD.SCRAD.value]), float(values[FIELD.SCALT.value]),
	float(values[FIELD.EL_CMD.value]), float(values[FIELD.AZ_CMD.value]), float(values[FIELD.AF.value]),
	float(values[FIELD.ORIENTLAT.value]), float(values[FIELD.ORIENTATION.value]), float(values[FIELD.C.value]),
	int(values[FIELD.DET.value]), float(values[FIELD.VLOOKX.value]), float(values[FIELD.VLOOKY.value]),
	float(values[FIELD.VLOOKZ.value]), float(values[FIELD.RADIANCE.value]), float(values[FIELD.TB.value]),
	float(values[FIELD.CLAT.value]), float(values[FIELD.CLON.value]), float(values[FIELD.CEMIS.value]),
	float(values[FIELD.CSUNZEN.value]), float(values[FIELD.CSUNAZI.value]), float(values[FIELD.CLOCTIME.value]),
    float(values[FIELD.QCA.value]), int(values[FIELD.QGE.value]), int(values[FIELD.QMI.value])]


'DATE': "01-Jan-2010"
'UTC': "01:20:00.022"
'JDATE': 2455197.555555820
'ORBIT': 2368
'SUNDIST': 0.98570
'SUNLAT': -0.31138
'SUNLON': 354.67220
'SCLK': 0284001599.65208
'SCLAT': 8.97501
'SCLON': 263.43874
'SCRAD': 1790.82210
'SCALT': 53.60619
'EL_CMD': 180.000
'AZ_CMD': 240.000
'AF': 110
'ORIENTLAT': -0.91547
'ORIENTATION': 173.69001
'C': 7
'DET': 1
'VLOOKX': 0.078868
'VLOOKY': 0.978111
'VLOOKZ': -0.192560
'RADIANCE': 0.4484
'TB': 95.856
'CLAT': 8.90937
'CLON': 263.37909
'CEMIS': 2.95260
'CSUNZEN': 91.32641
'CSUNAZI': 48.33062
'CLOCTIME': 5.91361
'QCA': 000
'QGE': 012
'QMI': 000


In [106]:
import sys

index = 0
size = 0
for val in values:
    tmp_size = sys.getsizeof(val)
    print(repr(FIELD_LIST[index].name) + ": " + repr(tmp_size))
    index += 1
    size += tmp_size

print("Total size: " + repr(size))


'DATE': 62
'UTC': 63
'JDATE': 66
'ORBIT': 53
'SUNDIST': 56
'SUNLAT': 57
'SUNLON': 58
'SCLK': 65
'SCLAT': 56
'SCLON': 58
'SCRAD': 59
'SCALT': 57
'EL_CMD': 56
'AZ_CMD': 56
'AF': 52
'ORIENTLAT': 57
'ORIENTATION': 58
'C': 50
'DET': 50
'VLOOKX': 57
'VLOOKY': 57
'VLOOKZ': 58
'RADIANCE': 55
'TB': 55
'CLAT': 56
'CLON': 58
'CEMIS': 56
'CSUNZEN': 57
'CSUNAZI': 57
'CLOCTIME': 56
'QCA': 52
'QGE': 52
'QMI': 52
Total size: 1867


In [108]:
index = 2
size = 0

new_job_values = job_values[2:]

for val in new_job_values:
    tmp_size = sys.getsizeof(val)
    print(repr(FIELD_LIST[index].name) + ": " + repr(tmp_size))
    index += 1
    size += tmp_size

print("Total size: " + repr(size))


'JDATE': 24
'ORBIT': 24
'SUNDIST': 24
'SUNLAT': 24
'SUNLON': 24
'SCLK': 24
'SCLAT': 24
'SCLON': 24
'SCRAD': 24
'SCALT': 24
'EL_CMD': 24
'AZ_CMD': 24
'AF': 24
'ORIENTLAT': 24
'ORIENTATION': 24
'C': 24
'DET': 28
'VLOOKX': 24
'VLOOKY': 24
'VLOOKZ': 24
'RADIANCE': 24
'TB': 24
'CLAT': 24
'CLON': 24
'CEMIS': 24
'CSUNZEN': 24
'CSUNAZI': 24
'CLOCTIME': 24
'QCA': 24
'QGE': 28
'QMI': 24
Total size: 752


In [56]:
print("New date: " + repr(sys.getsizeof(new_date)))
print("New utc: " + repr(sys.getsizeof(new_utc)))

New date: 28
New utc: 28


In [1]:
filepath = "/esthar/diviner_data/txt_files/useful_tabs.txt"

count = 0

with open(filepath, 'r') as file:
    for line in file:
        parts = line.split()
        number = int(parts[-1])
        count += number

print(count)

1257795


In [None]:
import DateTime
