# Enable Git Proxy for private Git server connectivity in Repos

### Overview
This private preview feature is available on AWS, Azure and GCP.

**Note**: an *admin* must run this notebook to enable the feature.

"Run all" this notebook to set up a cluster that proxies requests to your private Git server. Running this notebook does the following things:

0. Creates a [single node cluster](https://docs.databricks.com/clusters/single-node.html) to run Git Proxy on it. **Important**: all users in the workspace will be granted "attach to" permissions to the cluster.
0. Enables a feature flag that controls whether Git requests in Repos are proxied via the cluster.

You may need to wait several minutes after running this notebook for the cluster to reach a "RUNNING" state. It can also take up to 5 minutes for the feature flag configuration to take effect.

## Create the proxy cluster

#### Setup admin token and HTTP requests
Get the admin token from current context, prepare HTTP requests for Databricks APIs

In [0]:
import requests

admin_token = (
    dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
)
databricks_instance = (
    dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
)

headers = {"Authorization": f"Bearer {admin_token}"}

# Clusters
CLUSTERS_LIST_ENDPOINT = "/api/2.0/clusters/list"
CLUSTERS_CREATE_ENDPOINT = "/api/2.0/clusters/create"
CLUSTERS_LIST_NODE_TYPES_ENDPOINT = "/api/2.0/clusters/list-node-types"
CLUSTERS_GET_ENDPOINT = "/api/2.0/clusters/get"

# Permissions
UPDATE_PERMISSIONS_ENDPOINT = "/api/2.0/permissions/clusters"

# Workspace Conf
WORKSPACE_CONF_ENDPOINT = "/api/2.0/workspace-conf"

#### Create the cluster
Call Databricks Cluster API to create the Git Proxy cluster

In [0]:
cluster_name = "Repos Git Proxy"
create_cluster_data = {
    "cluster_name": cluster_name,
    "spark_version": "12.2.x-scala2.12",
    "num_workers": 0,
    "autotermination_minutes": 0,
    "spark_conf": {
        "spark.databricks.cluster.profile": "singleNode",
        "spark.master": "local[*]",
    },
    "custom_tags": {"ResourceClass": "SingleNode"},
}
# get list of node types to determine whether this workspace is on AWS or Azure
clusters_node_types = requests.get(
    databricks_instance + CLUSTERS_LIST_NODE_TYPES_ENDPOINT, headers=headers
).json()["node_types"]
node_type_ids = [type["node_type_id"] for type in clusters_node_types]
aws_node_type_id = "m5.large"
azure_node_type_id = "Standard_DS3_v2"
gcp_node_type_id = "e2-standard-4"
if aws_node_type_id in node_type_ids:
    create_cluster_data = {
        **create_cluster_data,
        "node_type_id": aws_node_type_id,
        "aws_attributes": {"ebs_volume_count": "1", "ebs_volume_size": "32", "first_on_demand": "1"},
    }
elif azure_node_type_id in node_type_ids:
    create_cluster_data = {**create_cluster_data, "node_type_id": azure_node_type_id}
elif gcp_node_type_id in node_type_ids:
    create_cluster_data = {**create_cluster_data, "node_type_id": gcp_node_type_id}
else:
    raise ValueError(
        f"Node types {aws_node_type_id} or {azure_node_type_id} do not exist. Make sure you are on AWS or Azure, or contact support."
    )

# Note: this only returns up to 100 terminated all-purpose clusters in the past 30 days
clusters_list_response = requests.get(
    databricks_instance + CLUSTERS_LIST_ENDPOINT, headers=headers
).json()
clusters_list = clusters_list_response["clusters"]
clusters_names = [
    cluster["cluster_name"] for cluster in clusters_list_response["clusters"]
]
print(f"List of existing clusters: {clusters_names}")

if cluster_name in clusters_names:
    raise ValueError(
        f"Cluster called {cluster_name} already exists. Please delete this cluster and re-run this notebook"
    )
else:
    # Create a new cluster named cluster_name that will proxy requests to the private Git server
    print(f"Create cluster POST request data: {create_cluster_data}")
    clusters_create_response = requests.post(
        databricks_instance + CLUSTERS_CREATE_ENDPOINT,
        headers=headers,
        json=create_cluster_data,
    ).json()
    print(f"Create cluster response: {clusters_create_response}")
    cluster_id = clusters_create_response["cluster_id"]
    print(f"Created new cluster with id {cluster_id}")
    update_permissions_data = {
        "access_control_list": [
            {"group_name": "users", "permission_level": "CAN_ATTACH_TO"}
        ]
    }
    update_permissions_response = requests.patch(
        databricks_instance + UPDATE_PERMISSIONS_ENDPOINT + f"/{cluster_id}",
        headers=headers,
        json=update_permissions_data,
    ).json()
    print(f"Update permissions response: {update_permissions_response}")
    print(f"Gave all users ATTACH TO permissions to cluster {cluster_id}")

# Wait for the cluster to be ready

Before we can send traffic to it, we should wait for the cluster to be up and ready to serve traffic

In [0]:
import time 

sleep_time_s = 10

state = None
while True:
    time.sleep(10)
    clusters_get_response = requests.get(
        url=databricks_instance + CLUSTERS_GET_ENDPOINT,
        headers=headers,
        params={"cluster_id": cluster_id }
    ).json()
    state = clusters_get_response.get("state", None)
    if state == 'RUNNING':
        print("Cluster is ready!")
        break
    else:
        print("Cluster is in state %s, waiting for %s seconds and trying again" % (state, sleep_time_s))
        time.sleep(sleep_time_s)

## Flip the feature flag!
This flips the feature flag to route Git requests to the cluster. The change should take into effect within an hour.

In [0]:
patch_enable_git_proxy_data = {"enableGitProxy": "true"}
patch_git_proxy_cluster_name_data = {"gitProxyClusterId": cluster_id}
requests.patch(
    databricks_instance + WORKSPACE_CONF_ENDPOINT,
    headers=headers,
    json=patch_enable_git_proxy_data,
)
requests.patch(
    databricks_instance + WORKSPACE_CONF_ENDPOINT,
    headers=headers,
    json=patch_git_proxy_cluster_name_data,
)

#### Check that flag has been set
If the command below returns with `{"enableGitProxy":"true"}`, you should be all set. Also, if you configured a custom cluster name using the widget, check that the cluster name in the response matches the name you specified.

In [0]:
get_flag_response = requests.get(
    databricks_instance + WORKSPACE_CONF_ENDPOINT + "?keys=enableGitProxy",
    headers=headers,
).json()
get_cluster_id_response = requests.get(
    databricks_instance + WORKSPACE_CONF_ENDPOINT + "?keys=gitProxyClusterId",
    headers=headers,
).json()
print(f"Get enableGitProxy response: {get_flag_response}")
print(f"Get gitProxyClusterId response: {get_cluster_id_response}")

## Wait for the cluster, and more configuration

#### Wait for the cluster to be ready
Please wait for the cluster to be ready, you can check the cluster status at [Compute](#setting/clusters), the cluster name is "Repos Git Proxy".

#### More configuration (only needed if the Git Proxy not working with default setup)
You can configure Git Proxy to match your setup, edit the cluster, and go to Advanced options -> Spark -> Environment variables. Supported environment variables are:
- GIT_PROXY_ENABLE_SSL_VERIFICATION: You may want to disable SSL verification if you are using a self-signed certificate for your private Git server. Example: true/false
- GIT_PROXY_CA_CERT_PATH: Or you can provide a CA cert file for SSL verification.

**GIT_PROXY_CUSTOM_HTTP_PORT and GIT_PROXY_HTTP_PROXY are not available at this time**
After updating the environment variables, please save and restart the cluster.