# Data Preprocess

This Notebook instance provides a procedure to pre-process Diviner data. The authors use the Diviner Channel 7 dataset from January 2010 - March 2019. However, since more recent data is available up until 2023, this additional data will be used as well. Note that, LRO changed from a near-circular to an elliptical orbit in December 2011 which changes the effective FOV of point measurements vary with latitude.

The totality of the uncompressed data will take up approximately 35TB, while the processed data will be in the neighbourhood of 3TB. In order to minimize the total amount of storage needed, the data will be downloaded and processed in batches. The processed data will be kept, while the unprocessed data will be deleted before processing the next batch.

We are interested in the .TAB data files which are stored within .zip files.

The steps involved:
1. Download Diviner data by month/year
2. Filter out the data points with the following parameters:
 	* instrument activity == 110 ("on moon" orientation, standard nadir observation, nominal instrument mode)
	* calibration == 0 (in-bounds interpolated measurement)
	* geometry flags == 12 (tracking data used to generate geometry information)
	* misc flag == 0 (no misc observations)
	* emission angle < 10 deg (angle between the vector from surface FOV center to DIVINER and the normal vector to the moon's surface)
    * only keep data from channel 7
3. Sort data into 0.5 deg x 0.5 deg latitude and longitude bins

## Import Required Libraries

In [28]:
from bs4 import BeautifulSoup
import concurrent.futures
from datetime import datetime
import os
import requests
import re
import subprocess
import sqlite3
from urllib.parse import urljoin
from zipfile import ZipFile

## Setup Local Directories

In [31]:
# The local directory where the data will be stored
# Note: Esthar has 10TB
DATA_DIR = '/esthar/diviner_data'

# Create the directory if it doesn't exist
if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)

## Setup Local Database

In [29]:
# TBD

## Common Functions

In [23]:
'''
@brief Returns a list of sub-links on a parent page.

@param parent_url The url page that is being searched.
@param pattern A regex pattern if required to filter the url list.

@return A list of sub-links on the page.
'''
def get_sub_urls(parent_url, pattern=None):

    # Send a GET request to get page elements
    response = requests.get(parent_url)
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract sub-urls
    sub_urls = [urljoin(parent_url, link.get("href")) for link in soup.find_all("a", href=True)]

    # Filter the list using regex if a pattern is specified
    if pattern:
        sub_urls = [url for url in sub_urls if re.compile(pattern).match(url)]

    return sub_urls
    

'''
@brief Crawls through urls on a page using multithreading

@param input_urls The parent urls to search
@param pattern Optional regex pattern to match url against 
'''
def multithread_crawl(input_urls, pattern=None):
    
    # Use multi-threading
    with concurrent.futures.ThreadPoolExecutor() as executor:
        
        if pattern:
            target_urls_list = list(executor.map(lambda target: get_sub_urls(target, pattern), input_urls))
        else:
            target_urls_list = list(executor.map(lambda target: get_sub_urls(target, target), input_urls))

    # Collapse into single list
    target_urls = [url for sublist in target_urls_list for url in sublist] 

    return target_urls
    
    
'''
@brief Walks through the RDR V1 parent links to find all
       zip file urls
'''
def find_all_zip_urls(target_year=None):

    # lroldr_1001 contains data from 2009 - 2016
    # lroldr_1002 contains data from 2017 - 2023
    parent_urls = [
        'https://pds-geosciences.wustl.edu/lro/lro-l-dlre-4-rdr-v1/lrodlr_1001/data/',
        'https://pds-geosciences.wustl.edu/lro/lro-l-dlre-4-rdr-v1/lrodlr_1002/data/']

    # Regex pattern will filter URLs for years
    # If a year is specified, the search is only for that year
    # Otherwise it is for all years 2010-2023
    if target_year:
        pattern = r'.*/{0}/'.format(target_year)
    else:
        pattern = r'.*/20[1-2]\d/$'
        
    # Generate list of year urls
    year_urls = get_sub_urls(parent_urls[0], pattern) + get_sub_urls(parent_urls[1], pattern)
    
    # Search for month urls
    month_urls = multithread_crawl(year_urls)

    # Search for day urls
    day_urls = multithread_crawl(month_urls)

    # Search for zip urls
    zip_urls = multithread_crawl(day_urls, r'.+\.zip$')

    return zip_urls
        

'''
@brief Given a link to a .zip file, this function will
       download, unpack the .zip file, then delete
       the original .zip file to minimize storage used
       
@param local_dir The local directory to save to
@param zip_url The url to the target .zip file
'''
def download_unpack_delete(dest_dir, src_url):

    # Verify the destination directory exists
    os.makedirs(dest_dir, exist_ok=True)

    # Extract filename
    filename = os.path.join(dest_dir, src_url.split("/")[-1])

    # Download the zip file
    response = requests.get(src_url)
    with open(filename, 'wb') as file:
        file.write(response.content)

    # Extract the contents of the zip file
    with ZipFile(filename, 'r') as zip_ref:
        zip_ref.extractall(dest_dir)

    # Delete original .zip file
    os.remove(filename)


## 2010

First, find all the .zip file urls for the year 2010

In [9]:
zip_urls = find_all_zip_urls('2010')

print("Found " + repr(len(zip_urls)) + " .zip file urls")

Found 51483 .zip file urls


Download the zip files, unpack the .TAB file, and then delete the original zip file. Note that, each .TAB file seems to be about 289M. If we downloaded all 51,483 .TAB files from 2010 alone, that would require just over 14TB of storage. 

In [27]:
download_unpack_delete(DATA_DIR, zip_urls[1])

Insert the .TAB data into the local database, then delete the original .TAB file

In [None]:
# TBD

Delete data points that don't meet the parameter criteriae.

In [30]:
# TBD