# Introduction

## Architecture

I decided to solve this challenge using the Azure cloud, more specifially using an Azure Blob storage for the JSON file, using an Azure Key Vault to store any type of secrets, and finally using Azure Databricks as my data engineering platform of choice, thanks to its great capabilities. 
As I am aware, that LATAM Airlines operates on GCP, I would change my storage solution to Google Cloud Storage and my Key Vault solution to Google Secret Manager. As Databricks is a multi-cloud platform that operates on Azure, AWS, and GCP, we can keep that service. 

## Setup

First I created some variables, that I will be using later.

In [0]:
storage_account_name = dbutils.secrets.get(scope="kv-scope", key="storage-account-name")
container_name = "data"
mount_point = "/mnt/data"
file_path = "farmers-protest-tweets-2021-2-4.json"
dbfs_file_path = f"/dbfs{mount_point}/{file_path}"

Then I could proceed and mount the Azure Blob Storage container into the Databricks environment, so I can access the JSON file directly from Databricks.

In [0]:
# Create this mount, if it does not already exist
if not any(mount.mountPoint == mount_point for mount in dbutils.fs.mounts()):
    dbutils.fs.mount(
        source = f"wasbs://{container_name}@{storage_account_name}.blob.core.windows.net/",
        mount_point = mount_point,
        extra_configs  = {f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net" : dbutils.secrets.get(scope="kv-scope", key="access-key")}
    )

Next, I install all relevant libraries on the cluster and choose to hide the output.

In [0]:
%pip install -q memory_profiler py-spy emoji polars

In [0]:
from memory_profiler import profile
import importlib

# Challenge

Generally speaking, in this challenge, I chose Polars over PySpark and Pandas dataframes due to its superior performance and efficiency in handling large data sets. While Pandas is a popular choice for data manipulation, it is significantly slower and more memory-intensive, especially for large JSON files, making it the least suitable option here. PySpark is powerful for distributed data processing, but in this case, its initialization and processing overheads resulted in slower performance compared to Polars. Additionally, for memory-optimized functions, I opted to read the JSON file row by row to avoid loading the entire file into memory, which allows for more efficient memory usage. Polars’ lightweight, efficient DataFrame operations, combined with this row-by-row approach, enabled me to achieve both speed and memory efficiency.

## q1_memory

In [0]:
from src.q1_memory import q1_memory

In [0]:
result = q1_memory(dbfs_file_path)
result

## q1_time

In [0]:
from src.q1_time import q1_time

In [0]:
result = q1_time(dbfs_file_path)
result

## q2_memory

In [0]:
from src.q2_memory import q2_memory

In [0]:
result = q2_memory(dbfs_file_path)
result

## q2_time

In [0]:
from src.q2_time import q2_time

In [0]:
result = q2_time(dbfs_file_path)
result

## q3_memory

In [0]:
from src.q3_memory import q3_memory

In [0]:
result = q3_memory(dbfs_file_path)
result

## q3_time

In [0]:
from src.q3_time import q3_time

In [0]:
result = q3_time(dbfs_file_path)
result

# Metrics Monitoring

In [0]:
import time
from memory_profiler import profile, memory_usage
import polars as pl


# List of functions to profile
functions_to_profile = [
    q1_memory, q1_time, 
    q2_memory, q2_time, 
    q3_memory, q3_time
]

# Helper function to measure memory usage only
def measure_memory(func, *args, **kwargs):
    # Decorate the function with @profile in memory-only mode
    profiled_func = profile(func)
    mem_usage = memory_usage((profiled_func, args, kwargs), interval=0.1, retval=False)
    avg_memory = sum(mem_usage) / len(mem_usage)
    return avg_memory

# Helper function to measure execution time only
def measure_time(func, *args, **kwargs):
    start_time = time.time()
    func(*args, **kwargs)  # Run function without profiling
    end_time = time.time()
    execution_time = end_time - start_time
    return execution_time

# Initialize list to store results
metrics_list = []

# Run profiling for each function and store the results in the list
for func in functions_to_profile:
    avg_memory = measure_memory(func, dbfs_file_path)  # Measure memory usage
    execution_time = measure_time(func, dbfs_file_path)  # Measure execution time without profiling
    metrics_list.append({
        "Function Name": func.__name__,
        "Average Memory Usage (MiB)": round(avg_memory, 2),
        "Execution Time (seconds)": round(execution_time, 2)
    })

In [0]:
# Convert the list of metrics to a Polars DataFrame
metrics_df = pl.DataFrame(metrics_list)

print(metrics_df)