# LakeFS API Connectivity Test (Demonstration of Wrapper using Real Schema)

## Overview
**Purpose:** This notebook serves as a lightweight "Health Check" for the project infrastructure.

**Scope:**
1.  **Connection:** Verifies the connection to the LakeFS server (port 8000) using the Python client.
2.  **Authentication:** Tests the Admin credentials (Access Key/Secret Key).
3.  **IO Operations:** Performs a minimal upload/download cycle with real data schema to ensure the storage layer is active.

**Usage:**
Run this notebook *before* the main experiment (`LakeFS_Fraud.example.ipynb`) to ensure the environment is stable.

In [1]:
# Execute this entire cell only once if you run into any errors
!pip install lakefs-client
!pip install lakefs-client imbalanced-learn
!pip install xgboost lightgbm tensorflow

[0mCollecting xgboost
  Using cached xgboost-2.1.4-py3-none-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
Collecting lightgbm
  Using cached lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl.metadata (17 kB)
Collecting tensorflow
  Using cached tensorflow-2.13.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting nvidia-nccl-cu12 (from xgboost)
  Using cached nvidia_nccl_cu12-2.28.9-py3-none-manylinux_2_18_x86_64.whl.metadata (2.0 kB)
Collecting absl-py>=1.0.0 (from tensorflow)
  Using cached absl_py-2.3.1-py3-none-any.whl.metadata (3.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow)
  Using cached astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=23.1.21 (from tensorflow)
  Using cached flatbuffers-25.9.23-py2.py3-none-any.whl.metadata (875 bytes)
Collecting gast<=0.4.0,>=0.2.1 (from tensorflow)
  Using cached gast-0.4.0-py3-none-any.whl.metadata (1.1 kB)
Collecting google-pasta>=0.1.1 (from tensorflow)
  Using cached g

Collecting certifi>=2017.4.17 (from requests<3,>=2.21.0->tensorboard<2.14,>=2.13->tensorflow)
  Using cached certifi-2025.11.12-py3-none-any.whl.metadata (2.5 kB)
Collecting pyasn1<0.7.0,>=0.6.1 (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.14,>=2.13->tensorflow)
  Using cached pyasn1-0.6.1-py3-none-any.whl.metadata (8.4 kB)
Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->google-auth-oauthlib<1.1,>=0.5->tensorboard<2.14,>=2.13->tensorflow)
  Using cached oauthlib-3.3.1-py3-none-any.whl.metadata (7.9 kB)
Using cached xgboost-2.1.4-py3-none-manylinux_2_28_x86_64.whl (223.6 MB)
Using cached lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl (3.6 MB)
Downloading tensorflow-2.13.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (479.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m479.6/479.6 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m
[?25hDownloading absl_py-2.3.1-py3-none-any.whl (135 kB)
Downloading astunpar

In [2]:
from LakeFS_Fraud_utils import LakeFSDataHandler
import pandas as pd

## Configuration & Initialization

### **Aim**
To establish a secure session with the LakeFS server using the provided credentials.

### **Inference**
If this step fails, it indicates an issue with the Docker container networking or invalid credentials. Success here confirms that the `LakeFSDataHandler` wrapper is correctly instantiated.

In [3]:
LAKEFS_HOST = 'http://host.docker.internal:8000' 
REPO_NAME = 'creditcard-fraud'
ACCESS_KEY = 'YOUR_ACCESS_KEY' 
SECRET_KEY = 'YOUR_SECRET_KEY'

In [4]:
handler = LakeFSDataHandler(LAKEFS_HOST, ACCESS_KEY, SECRET_KEY, REPO_NAME)
print("Wrapper Initialized.")

Wrapper Initialized.


## Use REAL Data Schema (Sample)

In [5]:
real_sample = pd.read_csv('creditcard.csv').head(100)
print("Loaded 100 rows of real creditcard.csv")

Loaded 100 rows of real creditcard.csv


## Upload Sample

### **Aim**
To verify that the system can write data to the `main` branch. We use a sample of the **real dataset** (`creditcard.csv`) rather than dummy data.

### **Inference**
Using the real schema ensures that there are no data type conflicts (e.g., float precision issues) during serialization. A successful upload confirms the "Write" path is operational.

In [6]:
handler.upload_df(real_sample, branch='main', path='api_test/real_sample_head.csv', message='API Test with Real Schema')

Uploading to branch 'main' at path 'api_test/real_sample_head.csv'...
Committing: API Test with Real Schema


## Download Verification

### **Aim**
To retrieve the data we just uploaded and verify it matches the original input.

### **Inference**
This confirms data integrity (Data In = Data Out). If the shape matches `(100, 31)`, the storage layer is fully functional, and we are safe to proceed to the full training pipeline.

In [7]:
df_download = handler.load_df(branch='main', path='api_test/real_sample_head.csv')
print("Downloaded Data Shape:", df_download.shape)
print(df_download.head())

Downloading from 'main/api_test/real_sample_head.csv'...
Downloaded Data Shape: (100, 31)
   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   
4 -0.270533  0.817739  