# sys.path - Repos and files data collection notebook
This is a companion notebook for https://go/git-syspath-sop - **you should refer to this document first**. The goal is to collect info and diagose issues with `sys.path` and Workspace file access in and outside of Git folders (repos).

The notebook should be run in different "configurations" to detect different problems.

Depending on the nature of the issue you are trying to debug, you should try running it in different:
* locations:
    - outside of a Git folder (ie. in a normal Workspace folder)
    - in a conventional Git folder (ie. without Git CLI)
    - in a Git CLI enabled Git folders (see
       [GCP doc](https://docs.databricks.com/gcp/en/repos/git-operations-with-repos#use-git-cli-commands-beta),
       [AWS doc](https://docs.databricks.com/aws/en/repos/git-operations-with-repos#use-git-cli-commands-beta),
       [Azure doc](https://learn.microsoft.com/en-us/azure/databricks/repos/git-operations-with-repos#use-git-cli))
* execution methods:
    - interactive - ie. running from the editor
    - in a job where it is referenced as a Workspace path
        - NOTE: there is a difference between running inside and outsid of a Git folder
    - in a "Git job" (see [AWS doc](https://docs.databricks.com/aws/en/jobs/configure-job#use-git-with-jobs)) where it is referenced as a file in a repo (conveniently this repo is public)
* using different types of compute (see [AWS doc](https://docs.databricks.com/aws/en/compute/)):
    - Serverless
    - Classic
* using different compute configurations:
    - different serverless image versions (environments, see [AWS doc](https://docs.databricks.com/aws/en/compute/serverless/dependencies))
    - different access modes (see [AWS doc](https://docs.databricks.com/aws/en/compute/use-compute#what-are-compute-access-modes))


**Once you run the notebook, follow it for instructions.**

Please note that you can inspect the code for more information - hover over the cell and click the show code icon.

## Notebook CWD and sys.path configuration

The next cell checks the notebook location and prints the assessment of whether this notebook is in a Git folder (repo) or not.

> If you just cloned the repo with this notebook, it should obviously state that the notebook is in a Git folder. If you moved it out, expect different outputs.

**What should I expect?**

The printed information and `CWD` (current working directory) and `sys.path` array ordering should follow [AWS link](https://docs.databricks.com/aws/en/libraries#python-library-precedence).

If in a Git folder (repo) `sys.path` array should contain:
* the **repo root folder**
* and the **notebook parent directory**.
Both should be placed close to the begining of `sys.path` array.

If not in a Git folder (repo) `sys.path` array should contain:
* the **notebook parent directory**.
It should be placed close to the enf of `sys.path` array.

The printed `CWD` should be equal to the **notebook parent directory**.

**What to do if the output seems incorrect?**

* **First** - take a look at following cells for known issues.
* **Otherwise** - note (copy) the notebook environment and file an ES ticket.

In [0]:
import sys
import os
import textwrap

# Determine if sys.path contains the repo root folder and notebook/file parent directory
repo_root = None
notebook_dir = os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get())
if not notebook_dir.startswith("/Workspace/"):
    notebook_dir = "/Workspace" + ("" if notebook_dir.startswith("/") else "/") + notebook_dir

print("Notebook/file parent directory:")
print("- value: ", notebook_dir)

notebook_path_same_as_cwd = notebook_dir == os.getcwd()
print("- same as CWD: ", notebook_path_same_as_cwd)

notebook_dir_in_path = notebook_dir in sys.path
print("- in sys.path: ", notebook_dir_in_path)

if (notebook_dir and notebook_dir_in_path):
  print("\nIn reference to https://docs.databricks.com/aws/en/libraries#python-library-precedence:")
  notebook_pos = sys.path.index(notebook_dir)
  next_path = sys.path[notebook_pos + 1] if len(sys.path) > notebook_pos + 1 else None
  # Since this notebook does not manipulate sys.path,
  # we can assume that if the path is at the end of sys.path, notebook is not in a repo.
  is_repo = notebook_pos != (len(sys.path) - 1) and os.path.commonpath([notebook_dir, next_path]) == next_path
  if is_repo:
      print(textwrap.dedent(f"""\
             - Notebook/file is likely in a repo:
               - sys.path position is low (sys.path[{notebook_pos}])
               - next path in sys.path is an ancestor directory (likely repo root): {next_path}
            """))
  else:
      print(textwrap.dedent(
          f"""\
          - Notebook/file parent is NOT likely in a repo:
            - sys.path position is high (sys.path[{notebook_pos}])
            - next path in sys.path is not an ancestor directory (likely not repo root): {next_path}
          **If the notebook IS IN A REPO - please file an ES.**
          """))

print("")
print("---------------------------------")
print("sys.path:\n", sys.path)
print("")
print("os.getcwd():\n", os.getcwd())



## Verify local file access

The next cell checks whether the notebook can access local files (on compute cluster) from Spark worker nodes (via ![UDF](path)).

In [0]:
import pandas as pd
from pyspark.sql.functions import pandas_udf

local_file_content = "This is a new file with test content."

@pandas_udf('string')
def f0(x):
    file_path = "/tmp/local-file-access-test.txt"
    with open(file_path, "w") as f:
        f.write(local_file_content)

    file_content = 'not read'
    try:
        with open(file_path, 'r') as file:
            file_content = file.read()        
    except FileNotFoundError:
        file_content = f'{file_path} not found'
    except Exception as e:
        file_content = f'An error occurred: {e}'
    return pd.Series([file_content])

print("Verifying access to local files from UDF:")
result = spark.range(1).repartition(1).select(f0("id")).collect()
udf_access_result = result[0][0]
if udf_access_result == local_file_content:
  print("- Local files can be accessed from UDF (Workers)")
else:
  print(f"- Error accessing local files from UDF (Workers): {udf_access_result}")
  print(f"  This might mean that the cluster is misconfigured or there is a bug in runtime.")
  print(f"  Please verify and contact #help-files if the problem cannot be fixed.")


## Verify Workspace file access

The next cell checks whether the notebook can access Workspace files from Spark worker nodes (via UDF).

In [0]:
from datetime import datetime
from pyspark.sql.functions import pandas_udf, lit
import pandas as pd
import random
import string
import os

def print_files_info_on_failure():
    print(f"  Check if Workspace file system (files) is enabled for the Workspace.")
    print(f"  If files are enabled, please contact #help-files.")


content = "This is a new file with test content."

wsfs_file_setup_success = False
file_path = None
udf_access_result = None

def create_test_file():
    print("Setting up the test file in users home:")
    user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().userName().get()
    base_path = f"/Workspace/Users/{user}"

    name_suffix = datetime.now().strftime('%Y-%m-%dT%H_%M_%S')
    file_path = f"{base_path}/file-access-test-file-{name_suffix}.txt"
    try:
        while os.path.exists(file_path):
            name_suffix = ''.join(random.choices(string.ascii_letters + string.digits, k=8))
            file_path = f"{base_path}{name_suffix}.txt"
    except Exception as e:
        print(f"- Cannot access Workspace files to prepare test file: {e}")
        print_files_info_on_failure()
        return None, False

    try:
        with open(file_path, 'w') as file:
            file.write(content)
        print(f"- Successfully set up the test file at {file_path}")
        print(f"  Driver can write Workspace files")
        return file_path, True
    except Exception as e:
        print(f"- Failed to set up the test file: {e}")
        print(f"  Driver cannot write Workspace files")
        print_files_info_on_failure()
        return None, False

@pandas_udf('string')
def f1(file_path_series: pd.Series) -> pd.Series:
    def read_file(file_path_from_series):
        try:
            with open(file_path_from_series, 'r') as file:
                return file.read()
        except FileNotFoundError:
            return f'{file_path_from_series} not found'
        except Exception as e:
            return f'An error occurred: {e}'
    return file_path_series.apply(read_file)



file_path, wsfs_file_setup_success = create_test_file()

print("\nVerifying access to Workspace files from UDF:")
if wsfs_file_setup_success and file_path:
    result = spark.range(1).repartition(1).select(f1(lit(file_path))).collect()
    udf_access_result = result[0][0]
    if udf_access_result == content:
        print("- Workspace files can be accessed from UDF (Workers). Workspace file system seems to work")
    else:
        print(f"- Error accessing Workspace files from UDF (Workers): {udf_access_result}")
        print_files_info_on_failure()

In [0]:
import sys

# Display notebook environment information
print("=== Notebook Environment ===")

# Spark version is always available
print(f"Spark Version: {spark.version}")

# Try to get Databricks Runtime version from different sources
try:
    dbr_version = spark.conf.get('spark.databricks.clusterUsageTags.sparkVersion')
    print(f"Databricks Runtime Version: {dbr_version}")
except:
    try:
        # Alternative way to get DBR version
        dbr_version = spark.conf.get('spark.databricks.clusterUsageTags.clusterVersion')
        print(f"Databricks Runtime Version: {dbr_version}")
    except:
        print("Databricks Runtime Version: Not available (Serverless)")

# Try to get cluster information
try:
    cluster_id = spark.conf.get('spark.databricks.clusterUsageTags.clusterId')
    print(f"Cluster ID: {cluster_id}")
except:
    print("Cluster ID: Not available (Serverless)")

try:
    cluster_name = spark.conf.get('spark.databricks.clusterUsageTags.clusterName')
    print(f"Cluster Name: {cluster_name}")
except:
    print("Cluster Name: Not available (Serverless)")

# Cloud provider
try:
    cloud_provider = spark.conf.get('spark.databricks.cloudProvider')
    print(f"Cloud Provider: {cloud_provider}")
except:
    print("Cloud Provider: Not available")

# Python version
print(f"Python Version: {sys.version.split()[0]}")

# Get user context
user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().userName().get()
print(f"Current User: {user}")

# Compute type
try:
    compute_type = spark.conf.get('spark.databricks.clusterSource')
    print(f"Compute Type: {compute_type}")
except:
    print("Compute Type: Serverless (inferred)")

# Access mode
try:
    access_mode = spark.conf.get('spark.databricks.clusterUsageTags.dataSecurityMode')
    print(f"Access Mode: {access_mode}")
except:
    print("Access Mode: Not available")

The next cell copies the current notebook so it can be conveniently run outside a Git folder.

In [0]:
import shutil
import os

print("=== Run this notebook outside of a repo ===")

# Suffix to prevent infinite loop
COPIED_SUFFIX = "_copied"

# Get the current notebook path
notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()

# Prevent infinite loop: if this is the copied notebook, skip copying/running
if notebook_path.endswith(COPIED_SUFFIX):
    print("Detected copied notebook execution. Skipping copy.")
else:
    notebook_file_path = notebook_path
    if not notebook_file_path.startswith("/Workspace/"):
        notebook_file_path = "/Workspace" + ("" if notebook_file_path.startswith("/") else "/") + notebook_file_path
    if not notebook_file_path.endswith('.ipynb'):
        notebook_file_path += '.ipynb'

    # Get the user's home directory
    user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().userName().get()
    home_dir = f"/Workspace/Users/{user}"

    # Add suffix to copied notebook name
    base_name = os.path.basename(notebook_path)
    copied_name = base_name + COPIED_SUFFIX
    dest_path = os.path.join(home_dir, copied_name)
    dest_file_path = dest_path + '.ipynb'

    print(f"Copying notebook:")
    print(f"  from: {notebook_file_path}")
    print(f"    to: {dest_path}")
    try:
        shutil.copy(notebook_file_path, dest_file_path)
        print(f"Navigate to the other notebook to run it and check results: ")
        workspace_url = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
        workspace_id = dbutils.notebook.entry_point.getDbutils().notebook().getContext().workspaceId().get()
        notebook_link = f"{workspace_url}/?o={workspace_id}#workspace/{dest_path}"
        print(f"  {notebook_link}")
    except Exception as e:
        print(f"- Failed to copy notebook: {e}")
        raise