# Enable Git Proxy for private Git server connectivity in Repos

### Overview
This feature is GA, it's available on AWS, Azure and GCP.

**Note**: an *admin* must run this notebook to enable the feature.

"Run all" this notebook to set up a cluster that proxies requests to your private Git server. Running this notebook does the following things:

0. Creates a [single node cluster](https://docs.databricks.com/clusters/single-node.html) to run Git Proxy on it.
0. Enables a feature flag that controls whether Git requests in Repos are proxied via the cluster.

You may need to wait several minutes after running this notebook for the cluster to reach a "RUNNING" state. It can also take up to 5 minutes for the feature flag configuration to take effect.

In [0]:
%pip install databricks-sdk==0.9.0
dbutils.library.restartPython()

## Create the proxy cluster

#### Setup admin token and HTTP requests
Get the admin token from current context, prepare HTTP requests for Databricks APIs

In [0]:
import requests
from databricks.sdk import WorkspaceClient
from databricks.sdk.core import ApiClient
from databricks.sdk.service import compute
from databricks.sdk.service import iam
from databricks.sdk.service.compute import AwsAttributes, AzureAttributes, GcpAttributes
w = WorkspaceClient()
api_client = ApiClient()

In [0]:
cluster_name = "Repos Git Proxy"

create_cluster_data = {
    "cluster_name": cluster_name,
    "spark_version": "12.2.x-scala2.12",
    "autotermination_minutes": 0,
    "num_workers": 0,
    "spark_conf": {
        "spark.databricks.cluster.profile": "singleNode",
        "spark.master": "local[*]",
    },
    "custom_tags": {"ResourceClass": "SingleNode"},
}

#### Create the cluster
Call Databricks Cluster API to create the Git Proxy cluster

In [0]:
# get list of node types to determine whether this workspace is on AWS or Azure
nodes = w.clusters.list_node_types()
node_type_ids = [t.node_type_id for t in nodes.node_types]
aws_node_type_id = "m5.large"
aws_nitro_node_type_id = "m5n.large"
azure_node_type_id = "Standard_DS3_v2"
gcp_node_type_id = "e2-standard-4"

if w.config.is_aws:
    if (aws_node_type_id and aws_nitro_node_type_id) not in node_type_ids:
        raise ValueError(
            f"Node types {aws_node_type_id & aws_nitro_node_type_id} do not exist in your workspace. Make sure the node type specified is available in your workspace, or contact support."
        )
    if aws_node_type_id in node_type_ids:
        node_type_id = aws_node_type_id
    else: 
        node_type_id = aws_nitro_node_type_id
    create_cluster_data["aws_attributes"] = AwsAttributes.from_dict({"ebs_volume_count": "1", "ebs_volume_size": "32", "first_on_demand": "1"})
elif w.config.is_azure:
    if azure_node_type_id not in node_type_ids:
        raise ValueError(
            f"Node types {azure_node_type_id} does not exist in your workspace. Make sure the node type specified is available in your workspace, or contact support."
        )
    node_type_id = azure_node_type_id
    create_cluster_data["azure_attributes"] = AzureAttributes.from_dict({
                        "availability": "ON_DEMAND_AZURE"
                        })
    
elif w.config.is_gcp:
    if gcp_node_type_id not in node_type_ids:
        raise ValueError(
            f"Node types {gcp_node_type_id} does not exist in your workspace. Make sure the node type specified is available in your workspace, or contact support."
        )
    node_type_id = gcp_node_type_id
    create_cluster_data["gcp_attributes"] = GcpAttributes.from_dict({
            "use_preemptible_executors": False
            })
else: 
   raise ValueError(
        f"The Databricks git proxy server only supports AWS, Azure and GCP. Running on an unsupported cloud. Please contact support."
    )

create_cluster_data["node_type_id"] = node_type_id


# Note: Return information about all pinned clusters, active clusters, up to 200 of the most recently terminated all-purpose clusters in the past 30 days, and up to 30 of the most recently terminated job clusters in the past 30 days. See https://github.com/databricks/databricks-sdk-py/blob/349216706aeac81828a807f40a21a1b0c80ed717/docs/workspace/clusters.rst?plain=1#L592
all_clusters = w.clusters.list()
clusters_names = [c.cluster_name for c in all_clusters]
print(f"List of existing clusters: {clusters_names}")

In [None]:
if cluster_name in clusters_names:
    raise ValueError(
        f"Cluster called {cluster_name} already exists. Please delete this cluster and re-run this notebook"
    )
else:
    # Create a new cluster named cluster_name that will proxy requests to the private Git server
    print(f"Create cluster POST request data: {create_cluster_data}")
    clusters_create_response = w.clusters.create(**create_cluster_data).result()
    print(f"Create cluster response: {clusters_create_response}")
    cluster_id = clusters_create_response.cluster_id
    print(f"Created new cluster with id {cluster_id}")

## Flip the feature flag!
This flips the feature flag to route Git requests to the cluster. The change should take into effect within an hour.

In [0]:
api_client.do("PATCH", "/api/2.0/workspace-conf", body={"enableGitProxy": "true"}, headers={"Content-Type": "application/json"})
api_client.do("PATCH", "/api/2.0/workspace-conf", body={"gitProxyClusterId": cluster_id}, headers={"Content-Type": "application/json"})

#### Check that flag has been set
If the command below returns with `{"enableGitProxy":"true"}`, you should be all set. Also, if you configured a custom cluster name using the widget, check that the cluster name in the response matches the name you specified.

In [0]:
get_flag_response = api_client.do("GET", "/api/2.0/workspace-conf", {"keys": "enableGitProxy"})
get_cluster_id_response = api_client.do("GET", "/api/2.0/workspace-conf", {"keys": "gitProxyClusterId"})
print(f"Get enableGitProxy response: {get_flag_response}")
print(f"Get gitProxyClusterId response: {get_cluster_id_response}")

#### Confirm the cluster is ready
You can check the cluster status at [Compute](#setting/clusters), the cluster name is "Repos Git Proxy".

#### More configuration (only needed if the Git Proxy not working with default setup)
You can configure Git Proxy to match your setup, edit the cluster, and go to Advanced options -> Spark -> Environment variables. Supported environment variables are:
- GIT_PROXY_ENABLE_SSL_VERIFICATION: You may want to disable SSL verification if you are using a self-signed certificate for your private Git server. Example: true/false
- GIT_PROXY_CA_CERT_PATH: Or you can provide a CA cert file for SSL verification.
- GIT_PROXY_HTTP_PROXY: If your data plane requires HTTP proxy for all HTTP trafic. Example: https://localhost:3128
- GIT_PROXY_CUSTOM_HTTP_PORT: If your Git server has non-standard HTTPS port. Example: 8443

After updating the environment variables, please save and restart the cluster.

## Flip the feature flag back if you are just testing
Please note that if you are just testing the proxy, make sure you keep the proxy cluster on during your test. Once the test finishes, you need to change the config back in order to restore the original behavior.

The step below undoes the config change to reroute Git requests. The change should take into effect within an hour.

In [0]:
# api_client.do("PATCH", "/api/2.0/workspace-conf", body={"enableGitProxy": "false"}, headers={"Content-Type": "application/json"})