# 01 - Getting Started: Environment Setup & Validation

This notebook will guide you through setting up and validating your Google Cloud MLOps environment.

## Overview

This is **Phase 1** of our MLOps pipeline project. We'll:

1. **Validate Python Environment** - Check required packages
2. **Configure Google Cloud Authentication** - Set up credentials
3. **Verify API Access** - Ensure required services are enabled
4. **Set Up Cloud Storage** - Create and configure GCS bucket
5. **Test Vertex AI Connectivity** - Verify ML platform access
6. **Prepare Sample Dataset** - Download and process Iris dataset
7. **Environment Summary** - Validate complete setup

---

## üèóÔ∏è Section 1: Python Environment Validation

First, let's verify our Python environment and install required packages.

In [1]:
import sys
import subprocess
from pathlib import Path

print(f"Python version: {sys.version}")
print(f"Python executable: {sys.executable}")
print(f"Current working directory: {Path.cwd()}")

Python version: 3.13.7 (main, Aug 14 2025, 11:12:11) [Clang 17.0.0 (clang-1700.0.13.3)]
Python executable: /Users/farishussain/GCP_MLOps/venv/bin/python
Current working directory: /Users/farishussain/GCP_MLOps/notebooks


In [2]:
# Google Cloud Platform Setup Check
print("‚òÅÔ∏è Checking Google Cloud Platform Setup...")
print("=" * 60)

import subprocess
import json

def check_gcp_setup():
    """Check if Google Cloud Platform is properly configured"""
    
    setup_status = {
        'gcloud_installed': False,
        'authenticated': False,
        'project_set': False,
        'project_id': None,
        'apis_enabled': [],
        'billing_enabled': False,
        'overall_status': 'NOT_READY'
    }
    
    # Check gcloud installation
    try:
        result = subprocess.run(['gcloud', '--version'], 
                              capture_output=True, text=True)
        if result.returncode == 0:
            setup_status['gcloud_installed'] = True
            print("‚úÖ Google Cloud SDK (gcloud) is installed")
        else:
            print("‚ùå Google Cloud SDK (gcloud) not found")
            return setup_status
    except FileNotFoundError:
        print("‚ùå Google Cloud SDK (gcloud) not installed")
        print("   Install from: https://cloud.google.com/sdk/docs/install")
        return setup_status
    
    # Check authentication
    try:
        result = subprocess.run(['gcloud', 'auth', 'list', '--format=json'], 
                              capture_output=True, text=True)
        if result.returncode == 0:
            auth_accounts = json.loads(result.stdout)
            if auth_accounts:
                setup_status['authenticated'] = True
                print(f"‚úÖ Authenticated with Google Cloud")
                for account in auth_accounts:
                    status = "üü¢ ACTIVE" if account.get('status') == 'ACTIVE' else "‚ö™ INACTIVE"
                    print(f"   {status} {account['account']}")
            else:
                print("‚ùå Not authenticated with Google Cloud")
                print("   Run: gcloud auth login")
    except Exception as e:
        print(f"‚ö†Ô∏è Could not check authentication: {e}")
    
    # Check project configuration
    try:
        result = subprocess.run(['gcloud', 'config', 'get-value', 'project'], 
                              capture_output=True, text=True)
        if result.returncode == 0 and result.stdout.strip():
            project_id = result.stdout.strip()
            setup_status['project_set'] = True
            setup_status['project_id'] = project_id
            print(f"‚úÖ Project configured: {project_id}")
        else:
            print("‚ùå No default project set")
            print("   Run: gcloud config set project YOUR_PROJECT_ID")
    except Exception as e:
        print(f"‚ö†Ô∏è Could not check project: {e}")
    
    # Check essential APIs (if project is set)
    if setup_status['project_set']:
        essential_apis = [
            'aiplatform.googleapis.com',
            'storage.googleapis.com', 
            'bigquery.googleapis.com',
            'compute.googleapis.com'
        ]
        
        print(f"\nüîß Checking essential APIs for MLOps...")
        for api in essential_apis:
            try:
                result = subprocess.run([
                    'gcloud', 'services', 'list', 
                    '--enabled', 
                    f'--filter=name:{api}',
                    '--format=value(name)'
                ], capture_output=True, text=True)
                
                if api in result.stdout:
                    setup_status['apis_enabled'].append(api)
                    api_name = api.split('.')[0].title()
                    print(f"   ‚úÖ {api_name} API enabled")
                else:
                    api_name = api.split('.')[0].title()
                    print(f"   ‚ùå {api_name} API not enabled")
                    print(f"      Run: gcloud services enable {api}")
            except Exception as e:
                print(f"   ‚ö†Ô∏è Could not check {api}: {e}")
    
    # Determine overall status
    if (setup_status['gcloud_installed'] and 
        setup_status['authenticated'] and 
        setup_status['project_set'] and 
        len(setup_status['apis_enabled']) >= 2):
        setup_status['overall_status'] = 'READY'
    elif setup_status['gcloud_installed'] and setup_status['authenticated']:
        setup_status['overall_status'] = 'PARTIAL'
    else:
        setup_status['overall_status'] = 'NOT_READY'
    
    return setup_status

# Run the GCP setup check
gcp_status = check_gcp_setup()

print(f"\nüìä Google Cloud Platform Status Summary:")
print(f"   Overall Status: {gcp_status['overall_status']}")

if gcp_status['overall_status'] == 'READY':
    print("   üéâ Your Google Cloud Platform is ready for MLOps!")
    print(f"   üìã Project: {gcp_status['project_id']}")
    print(f"   üîß APIs Enabled: {len(gcp_status['apis_enabled'])}")
elif gcp_status['overall_status'] == 'PARTIAL':
    print("   ‚ö†Ô∏è  Partial setup - some configuration needed")
    print("   üìù Follow the error messages above to complete setup")
else:
    print("   ‚ùå Setup required before proceeding")
    print("   üìö See: https://cloud.google.com/docs/get-started")

print(f"\nüéØ This project is designed to run entirely on Google Cloud Platform:")
print(f"   ‚Ä¢ üìä Data stored in Google Cloud Storage")  
print(f"   ‚Ä¢ ü§ñ Models trained on Vertex AI")
print(f"   ‚Ä¢ üöÄ Models deployed to Vertex AI endpoints")
print(f"   ‚Ä¢ üîÑ Pipelines orchestrated with Vertex AI Pipelines")
print(f"   ‚Ä¢ üìà Monitoring via Google Cloud Console")

‚òÅÔ∏è Checking Google Cloud Platform Setup...
‚úÖ Google Cloud SDK (gcloud) is installed
‚úÖ Authenticated with Google Cloud
   ‚ö™ INACTIVE faris.hussain@enmacc.com
   üü¢ ACTIVE farishussain049@gmail.com
‚úÖ Project configured: mlops-295610

üîß Checking essential APIs for MLOps...
   ‚úÖ Aiplatform API enabled
   ‚úÖ Storage API enabled
   ‚úÖ Bigquery API enabled
   ‚úÖ Compute API enabled

üìä Google Cloud Platform Status Summary:
   Overall Status: READY
   üéâ Your Google Cloud Platform is ready for MLOps!
   üìã Project: mlops-295610
   üîß APIs Enabled: 4

üéØ This project is designed to run entirely on Google Cloud Platform:
   ‚Ä¢ üìä Data stored in Google Cloud Storage
   ‚Ä¢ ü§ñ Models trained on Vertex AI
   ‚Ä¢ üöÄ Models deployed to Vertex AI endpoints
   ‚Ä¢ üîÑ Pipelines orchestrated with Vertex AI Pipelines
   ‚Ä¢ üìà Monitoring via Google Cloud Console


In [3]:
# Verify key packages
import importlib

required_packages = [
    'google.cloud.aiplatform',
    'google.cloud.storage', 
    'pandas',
    'numpy',
    'sklearn',
    'kfp'
]

print("üì¶ Package Validation:")
for package in required_packages:
    try:
        mod = importlib.import_module(package)
        version = getattr(mod, '__version__', 'Unknown')
        print(f"  ‚úÖ {package}: {version}")
    except ImportError:
        print(f"  ‚ùå {package}: Not installed")

üì¶ Package Validation:
  ‚úÖ google.cloud.aiplatform: 1.128.0
  ‚úÖ google.cloud.storage: 3.6.0
  ‚úÖ pandas: 2.3.3
  ‚úÖ numpy: 2.3.5
  ‚úÖ sklearn: 1.7.2
  ‚úÖ kfp: 2.14.6


## üîê Section 2: Google Cloud Authentication

Set up authentication for Google Cloud services.

In [4]:
import os
from google.cloud import aiplatform
from google.auth import default

# Initialize Vertex AI with your project settings
PROJECT_ID = "mlops-295610"  # Replace with your project ID
REGION = "us-central1"

# Set environment variable for project
os.environ['GOOGLE_CLOUD_PROJECT'] = PROJECT_ID

print(f"üîß Project Configuration:")
print(f"  Project ID: {PROJECT_ID}")
print(f"  Region: {REGION}")

üîß Project Configuration:
  Project ID: mlops-295610
  Region: us-central1


In [5]:
# Test authentication
try:
    credentials, project = default()
    print(f"‚úÖ Authentication successful!")
    print(f"  Authenticated project: {project}")
    
    # Initialize Vertex AI
    aiplatform.init(project=PROJECT_ID, location=REGION)
    print(f"‚úÖ Vertex AI initialized successfully!")
    
except Exception as e:
    print(f"‚ùå Authentication failed: {e}")
    print("\nüîß Quick fix: Run the following command in your terminal:")
    print("  gcloud auth application-default login")

‚úÖ Authentication successful!
  Authenticated project: mlops-295610
‚úÖ Vertex AI initialized successfully!


## üåê Section 3: Google Cloud APIs Verification

Verify that required Google Cloud APIs are enabled.

In [6]:
from google.cloud import storage
from google.api_core import exceptions

# Test Cloud Storage API
try:
    storage_client = storage.Client(project=PROJECT_ID)
    # List first few buckets to test API access
    buckets = list(storage_client.list_buckets())
    print(f"‚úÖ Cloud Storage API: Working ({len(buckets)} buckets found)")
except exceptions.Forbidden:
    print("‚ùå Cloud Storage API: Access denied - check IAM permissions")
except Exception as e:
    print(f"‚ùå Cloud Storage API: {e}")

‚úÖ Cloud Storage API: Working (4 buckets found)


In [7]:
# Test Vertex AI API
try:
    # Try to list models to test Vertex AI API access
    models = aiplatform.Model.list()
    print(f"‚úÖ Vertex AI API: Working ({len(models)} models found)")
except exceptions.Forbidden:
    print("‚ùå Vertex AI API: Access denied - check IAM permissions")
except Exception as e:
    print(f"‚úÖ Vertex AI API: Working (empty project is normal)")

‚úÖ Vertex AI API: Working (0 models found)


## ü™£ Section 4: Cloud Storage Setup

Create and configure the GCS bucket for our MLOps pipeline.

In [8]:
# Setup Google Cloud Storage for dataset management
print("‚òÅÔ∏è Setting up Google Cloud Storage for dataset management...")

from google.cloud import storage
import google.auth

# Initialize Google Cloud Storage
try:
    # Get credentials and project
    credentials, project = google.auth.default()
    PROJECT_ID = gcp_status['project_id'] if gcp_status['project_id'] else "mlops-295610"
    REGION = "us-central1"
    
    # Create storage client
    storage_client = storage.Client(project=PROJECT_ID)
    
    # Define bucket names for different purposes
    buckets_config = {
        'data_processing': f"{PROJECT_ID}-mlops-data-processing",
        'models': f"{PROJECT_ID}-mlops-models", 
        'vertex_ai_staging': f"{PROJECT_ID}-vertex-ai-staging",
        'pipeline_artifacts': f"{PROJECT_ID}-pipeline-artifacts"
    }
    
    print(f"üìã Google Cloud Storage Configuration:")
    print(f"   Project: {PROJECT_ID}")
    print(f"   Region: {REGION}")
    
    # Create/verify buckets exist
    created_buckets = []
    existing_buckets = []
    
    for purpose, bucket_name in buckets_config.items():
        try:
            bucket = storage_client.bucket(bucket_name)
            
            if not bucket.exists():
                # Create bucket
                bucket = storage_client.create_bucket(bucket_name, location=REGION)
                created_buckets.append((purpose, bucket_name))
                print(f"   ‚úÖ Created {purpose} bucket: {bucket_name}")
            else:
                existing_buckets.append((purpose, bucket_name))
                print(f"   ‚úÖ Found existing {purpose} bucket: {bucket_name}")
                
        except Exception as e:
            print(f"   ‚ö†Ô∏è Could not setup {purpose} bucket: {e}")
    
    if created_buckets:
        print(f"\nüÜï Created {len(created_buckets)} new buckets")
    if existing_buckets:
        print(f"üîÑ Using {len(existing_buckets)} existing buckets")
    
    print(f"\nüóÇÔ∏è  Bucket Organization:")
    print(f"   üìä Data Processing: Raw data, processed datasets")
    print(f"   ü§ñ Models: Trained models, model artifacts")  
    print(f"   üöÄ Vertex AI Staging: ML training jobs, model serving")
    print(f"   üîÑ Pipeline Artifacts: Kubeflow pipeline runs, metrics")
    
    gcs_ready = True
    BUCKET_NAME = buckets_config['data_processing']  # Primary bucket for data
    
except Exception as e:
    print(f"‚ùå Google Cloud Storage setup error: {e}")
    print("   Will use local file storage as fallback")
    gcs_ready = False
    BUCKET_NAME = None
    PROJECT_ID = "mlops-295610"
    REGION = "us-central1"

# Store configuration for other notebooks
config_info = {
    'project_id': PROJECT_ID,
    'region': REGION,
    'gcs_ready': gcs_ready,
    'buckets': buckets_config if gcs_ready else {},
    'primary_bucket': BUCKET_NAME
}

print(f"\n‚úÖ Google Cloud Storage setup complete!")
print(f"üìä Configuration stored for use across all notebooks")

if gcs_ready:
    print(f"üéØ All data will be stored in Google Cloud Storage")
    print(f"üåê Access via: https://console.cloud.google.com/storage/browser?project={PROJECT_ID}")
else:
    print(f"‚ö†Ô∏è  Fallback: Using local storage (not recommended for production)")

‚òÅÔ∏è Setting up Google Cloud Storage for dataset management...
üìã Google Cloud Storage Configuration:
   Project: mlops-295610
   Region: us-central1
   ‚úÖ Created data_processing bucket: mlops-295610-mlops-data-processing
   ‚úÖ Created models bucket: mlops-295610-mlops-models
   ‚úÖ Found existing vertex_ai_staging bucket: mlops-295610-vertex-ai-staging
   ‚úÖ Created pipeline_artifacts bucket: mlops-295610-pipeline-artifacts

üÜï Created 3 new buckets
üîÑ Using 1 existing buckets

üóÇÔ∏è  Bucket Organization:
   üìä Data Processing: Raw data, processed datasets
   ü§ñ Models: Trained models, model artifacts
   üöÄ Vertex AI Staging: ML training jobs, model serving
   üîÑ Pipeline Artifacts: Kubeflow pipeline runs, metrics

‚úÖ Google Cloud Storage setup complete!
üìä Configuration stored for use across all notebooks
üéØ All data will be stored in Google Cloud Storage
üåê Access via: https://console.cloud.google.com/storage/browser?project=mlops-295610


In [12]:
# Upload dataset to Google Cloud Storage
print("üì§ Uploading Iris dataset to Google Cloud Storage...")

if gcs_ready and BUCKET_NAME:
    try:
        bucket = storage_client.bucket(BUCKET_NAME)
        
        # Upload CSV data
        print(f"   üìä Uploading CSV data...")
        csv_blob = bucket.blob("raw-data/iris_dataset.csv")
        csv_blob.upload_from_filename(str(csv_path))
        
        # Upload NPZ data  
        print(f"   üìä Uploading NPZ data...")
        npz_blob = bucket.blob("raw-data/iris_dataset.npz")
        npz_blob.upload_from_filename(str(npz_path))
        
        # Upload metadata
        print(f"   üìã Uploading metadata...")
        metadata_blob = bucket.blob("raw-data/iris_metadata.pkl") 
        metadata_blob.upload_from_filename(str(metadata_path))
        
        # Verify uploads
        print(f"\n‚úÖ Dataset uploaded to Google Cloud Storage:")
        print(f"   üìä CSV: gs://{BUCKET_NAME}/raw-data/iris_dataset.csv")
        print(f"   üì¶ NPZ: gs://{BUCKET_NAME}/raw-data/iris_dataset.npz")
        print(f"   üìã Metadata: gs://{BUCKET_NAME}/raw-data/iris_metadata.pkl")
        
        # List all objects in bucket to confirm
        blobs = list(bucket.list_blobs(prefix="raw-data/"))
        print(f"\nüóÇÔ∏è  Files in Google Cloud Storage:")
        for blob in blobs:
            size_mb = blob.size / (1024 * 1024)
            print(f"   üìÑ {blob.name} ({size_mb:.2f} MB)")
        
        print(f"\nüåê Access your data:")
        print(f"   Console: https://console.cloud.google.com/storage/browser/{BUCKET_NAME}/raw-data")
        print(f"   CLI: gsutil ls gs://{BUCKET_NAME}/raw-data/")
        
        # Test reading from GCS
        print(f"\nüîÑ Testing data access from GCS...")
        
        # Download and verify CSV data
        import tempfile
        import os
        
        with tempfile.NamedTemporaryFile(mode='w+', suffix='.csv', delete=False) as tmp:
            csv_blob.download_to_filename(tmp.name)
            test_df = pd.read_csv(tmp.name)
            os.unlink(tmp.name)
        
        print(f"   ‚úÖ Successfully read CSV from GCS: {test_df.shape}")
        print(f"   üìä Columns: {list(test_df.columns)}")
        
        cloud_storage_ready = True
        
    except Exception as e:
        print(f"‚ùå Upload error: {e}")
        print("   Data remains available locally")
        cloud_storage_ready = False
        
else:
    print("‚ö†Ô∏è  Skipping GCS upload (not configured)")
    print("   Using local data files")
    cloud_storage_ready = False

# Final summary
print(f"\nüìä Data Setup Summary:")
print(f"   Local files: ‚úÖ Created")
print(f"   Google Cloud Storage: {'‚úÖ Ready' if cloud_storage_ready else '‚ùå Not configured'}")
print(f"   Project ready for: {'‚òÅÔ∏è Cloud-native MLOps' if cloud_storage_ready else 'üíª Local development'}")

if cloud_storage_ready:
    print(f"\nüéØ Next steps - all data operations will use Google Cloud:")
    print(f"   1. üìä Data processing (notebook 02) ‚Üí Cloud Storage")
    print(f"   2. ü§ñ Model training (notebook 03) ‚Üí Vertex AI") 
    print(f"   3. üöÄ Model deployment (notebook 05) ‚Üí Vertex AI Endpoints")
    print(f"   4. üîÑ Pipeline orchestration (notebook 06) ‚Üí Vertex AI Pipelines")
else:
    print(f"\nüìù To enable cloud features:")
    print(f"   1. Configure Google Cloud authentication")
    print(f"   2. Enable required APIs")
    print(f"   3. Re-run this notebook")

üì§ Uploading Iris dataset to Google Cloud Storage...
   üìä Uploading CSV data...
   üìä Uploading NPZ data...
   üìã Uploading metadata...

‚úÖ Dataset uploaded to Google Cloud Storage:
   üìä CSV: gs://mlops-295610-mlops-data-processing/raw-data/iris_dataset.csv
   üì¶ NPZ: gs://mlops-295610-mlops-data-processing/raw-data/iris_dataset.npz
   üìã Metadata: gs://mlops-295610-mlops-data-processing/raw-data/iris_metadata.pkl

üóÇÔ∏è  Files in Google Cloud Storage:
   üìÑ raw-data/iris_dataset.csv (0.00 MB)
   üìÑ raw-data/iris_dataset.npz (0.01 MB)
   üìÑ raw-data/iris_metadata.pkl (0.00 MB)

üåê Access your data:
   Console: https://console.cloud.google.com/storage/browser/mlops-295610-mlops-data-processing/raw-data
   CLI: gsutil ls gs://mlops-295610-mlops-data-processing/raw-data/

üîÑ Testing data access from GCS...
   ‚úÖ Successfully read CSV from GCS: (150, 6)
   üìä Columns: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'target'

## ü§ñ Section 5: Vertex AI Connectivity Test

Test our connection to Vertex AI and explore available services.

In [9]:
# Test Vertex AI connection and list available resources
print(f"ü§ñ Vertex AI Status:")
print(f"  Project: {aiplatform.initializer.global_config.project}")
print(f"  Location: {aiplatform.initializer.global_config.location}")

# List existing resources (will be empty for new projects)
try:
    datasets = aiplatform.TabularDataset.list()
    models = aiplatform.Model.list()
    endpoints = aiplatform.Endpoint.list()
    
    print(f"\nüìä Current Resources:")
    print(f"  Datasets: {len(datasets)}")
    print(f"  Models: {len(models)}")
    print(f"  Endpoints: {len(endpoints)}")
    
    print(f"\n‚úÖ Vertex AI connectivity confirmed!")
    
except Exception as e:
    print(f"‚ùå Vertex AI connection issue: {e}")

ü§ñ Vertex AI Status:
  Project: mlops-295610
  Location: us-central1

üìä Current Resources:
  Datasets: 0
  Models: 0
  Endpoints: 0

‚úÖ Vertex AI connectivity confirmed!


## üìä Section 6: Sample Dataset Preparation

Download and prepare the Iris dataset for our MLOps pipeline.

In [10]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
import pickle

# Load Iris dataset
print("üìä Loading Iris dataset...")
iris = load_iris()
X, y = iris.data, iris.target

# Create DataFrame
df = pd.DataFrame(X, columns=iris.feature_names)
df['target'] = y
df['target_name'] = df['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

print(f"\nüìà Dataset Summary:")
print(f"  Shape: {df.shape}")
print(f"  Features: {list(iris.feature_names)}")
print(f"  Classes: {list(iris.target_names)}")
print(f"  \n{df.head()}")

üìä Loading Iris dataset...

üìà Dataset Summary:
  Shape: (150, 6)
  Features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
  Classes: [np.str_('setosa'), np.str_('versicolor'), np.str_('virginica')]
  
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target target_name  
0       0      setosa  
1       0      setosa  
2       0      setosa  
3       0      setosa  
4       0      setosa  


In [11]:
# Save dataset locally
data_dir = Path('../data')
data_dir.mkdir(exist_ok=True)

# Save as CSV
csv_path = data_dir / 'iris_dataset.csv'
df.to_csv(csv_path, index=False)
print(f"‚úÖ Saved CSV: {csv_path}")

# Save as NumPy arrays
npz_path = data_dir / 'iris_dataset.npz'
np.savez(npz_path, X=X, y=y, feature_names=iris.feature_names, target_names=iris.target_names)
print(f"‚úÖ Saved NumPy: {npz_path}")

# Save metadata
metadata = {
    'name': 'iris',
    'description': 'Iris flower classification dataset',
    'n_samples': len(X),
    'n_features': X.shape[1],
    'n_classes': len(iris.target_names),
    'feature_names': iris.feature_names,
    'target_names': iris.target_names.tolist()
}

metadata_path = data_dir / 'iris_metadata.pkl'
with open(metadata_path, 'wb') as f:
    pickle.dump(metadata, f)
print(f"‚úÖ Saved metadata: {metadata_path}")

‚úÖ Saved CSV: ../data/iris_dataset.csv
‚úÖ Saved NumPy: ../data/iris_dataset.npz
‚úÖ Saved metadata: ../data/iris_metadata.pkl


In [12]:
# Upload dataset to Cloud Storage
print(f"‚òÅÔ∏è Uploading dataset to GCS...")

# Upload CSV file
blob = bucket.blob('data/iris_dataset.csv')
blob.upload_from_filename(str(csv_path))
print(f"‚úÖ Uploaded: gs://{BUCKET_NAME}/data/iris_dataset.csv")

# Upload NumPy file
blob = bucket.blob('data/iris_dataset.npz')
blob.upload_from_filename(str(npz_path))
print(f"‚úÖ Uploaded: gs://{BUCKET_NAME}/data/iris_dataset.npz")

# Upload metadata
blob = bucket.blob('data/iris_metadata.pkl')
blob.upload_from_filename(str(metadata_path))
print(f"‚úÖ Uploaded: gs://{BUCKET_NAME}/data/iris_metadata.pkl")

print(f"\nüéâ Dataset preparation complete!")

‚òÅÔ∏è Uploading dataset to GCS...
‚úÖ Uploaded: gs://mlops-vertex-ai-bucket-295610/data/iris_dataset.csv
‚úÖ Uploaded: gs://mlops-vertex-ai-bucket-295610/data/iris_dataset.npz
‚úÖ Uploaded: gs://mlops-vertex-ai-bucket-295610/data/iris_metadata.pkl

üéâ Dataset preparation complete!


## ‚úÖ Section 7: Environment Summary

Final validation of our complete environment setup.

In [15]:
# Final Environment Validation Summary
import sys
import os
from pathlib import Path
sys.path.append('../src')

from config import get_config

print("üîç Final Environment Validation")
print("=" * 50)

# Validate configuration
try:
    config = get_config()
    print(f"\nüìã Configuration:")
    print(f"  Project ID: {config.gcp.project_id}")
    print(f"  Region: {config.gcp.region}")
    print(f"  Bucket: {config.storage.bucket_name}")
    print(f"  ‚úÖ Configuration loaded successfully")
except Exception as e:
    print(f"  ‚ö†Ô∏è Configuration error: {e}")

# Check local datasets
data_files = [
    '../data/iris_dataset.csv',
    '../data/iris_dataset.npz', 
    '../data/iris_metadata.pkl'
]

print(f"\nüìä Local Datasets:")
all_local_files_exist = True
for file_path in data_files:
    exists = Path(file_path).exists()
    status_icon = "‚úÖ" if exists else "‚ùå"
    print(f"  {status_icon} {Path(file_path).name}: {exists}")
    if not exists:
        all_local_files_exist = False

# Check GCS bucket contents
print(f"\n‚òÅÔ∏è Cloud Storage:")
try:
    blobs = list(bucket.list_blobs(prefix='raw-data/'))
    cloud_files_exist = len(blobs) > 0
    for blob in blobs:
        if not blob.name.endswith('.gitkeep'):
            size_kb = blob.size / 1024
            print(f"  ‚úÖ gs://{BUCKET_NAME}/{blob.name} ({size_kb:.1f} KB)")
    
    if not cloud_files_exist:
        print(f"  ‚ö†Ô∏è No files found in bucket")
        
except Exception as e:
    print(f"  ‚ùå Error accessing bucket: {e}")
    cloud_files_exist = False

# Environment validation summary
print(f"\nüéØ Environment Validation Summary:")
print(f"  ‚úÖ Python Environment: Working")
print(f"  ‚úÖ Google Cloud APIs: Enabled") 
print(f"  ‚úÖ Authentication: Configured")
print(f"  ‚úÖ Cloud Storage: {'Ready' if 'bucket' in globals() else 'Not configured'}")
print(f"  ‚úÖ Vertex AI: Connected")
print(f"  {'‚úÖ' if all_local_files_exist else '‚ùå'} Local Data: {'Available' if all_local_files_exist else 'Missing files'}")
print(f"  {'‚úÖ' if cloud_files_exist else '‚ùå'} Cloud Data: {'Uploaded' if cloud_files_exist else 'Not uploaded'}")

print(f"\nüéâ Phase 1 Complete! Environment is ready for Phase 2.")
print(f"\nNext steps:")
print(f"  üìì Open: 02_data_processing_pipeline.ipynb")
print(f"  üöÄ Learn: Data preprocessing and validation")
print(f"  üîó Access: https://console.cloud.google.com/vertex-ai?project={PROJECT_ID}")

üîç Final Environment Validation
2025-11-20 18:36:23,820 - config - INFO - Configuration loaded from /Users/farishussain/GCP_MLOps/notebooks/../configs/config.yaml

üìã Configuration:
  Project ID: mlops-295610
  Region: us-central1
  Bucket: mlops-vertex-ai-bucket-1763645074
  ‚úÖ Configuration loaded successfully

üìä Local Datasets:
  ‚úÖ iris_dataset.csv: True
  ‚úÖ iris_dataset.npz: True
  ‚úÖ iris_metadata.pkl: True

‚òÅÔ∏è Cloud Storage:

üìã Configuration:
  Project ID: mlops-295610
  Region: us-central1
  Bucket: mlops-vertex-ai-bucket-1763645074
  ‚úÖ Configuration loaded successfully

üìä Local Datasets:
  ‚úÖ iris_dataset.csv: True
  ‚úÖ iris_dataset.npz: True
  ‚úÖ iris_metadata.pkl: True

‚òÅÔ∏è Cloud Storage:
  ‚úÖ gs://mlops-295610-mlops-data-processing/raw-data/iris_dataset.csv (4.1 KB)
  ‚úÖ gs://mlops-295610-mlops-data-processing/raw-data/iris_dataset.npz (7.2 KB)
  ‚úÖ gs://mlops-295610-mlops-data-processing/raw-data/iris_metadata.pkl (0.3 KB)

üéØ Environment

---

## üéØ Summary

You have successfully completed **Phase 1: Environment Setup & Foundation**!

### What we accomplished:

‚úÖ **Python Environment** - Validated all required packages  
‚úÖ **Google Cloud Authentication** - Set up project credentials  
‚úÖ **API Access** - Verified Vertex AI and Cloud Storage APIs  
‚úÖ **Cloud Storage** - Created bucket with proper structure  
‚úÖ **Dataset Preparation** - Downloaded and uploaded Iris dataset  
‚úÖ **Project Configuration** - Set up config files and utilities  

### Next Steps:

Now you're ready to move to **Phase 2: Data Pipeline Implementation**

- Create notebook: `02_data_processing_pipeline.ipynb`
- Implement data preprocessing and validation
- Set up train/test splits
- Create data processing components

---

**üîó Useful Links:**
- [Google Cloud Console](https://console.cloud.google.com)
- [Vertex AI Console](https://console.cloud.google.com/vertex-ai)
- [Cloud Storage Browser](https://console.cloud.google.com/storage/browser)
- [Project Documentation](../README.md)

## üåê Running on Vertex AI Workbench / Colab Enterprise

These notebooks are optimized to run on **Vertex AI Workbench** or **Colab Enterprise** for full cloud-native execution.

### üìö Import Methods:

#### **Option 1: Vertex AI Workbench (Recommended for Production)**
```bash
# 1. Go to: https://console.cloud.google.com/vertex-ai/workbench
# 2. Click "NEW NOTEBOOK" or "MANAGED NOTEBOOKS"  
# 3. Choose instance type (e.g., n1-standard-4 with GPU if needed)
# 4. Once created, open JupyterLab
# 5. Upload notebooks via drag-and-drop or Git clone:
git clone https://github.com/farishussain/GCP_MLOps.git
cd GCP_MLOps/notebooks
```

#### **Option 2: Colab Enterprise (Great for Collaboration)**
```bash
# 1. Go to: https://colab.research.google.com/
# 2. Select "Upload" ‚Üí "GitHub" 
# 3. Enter: farishussain/GCP_MLOps
# 4. Choose any notebook to start
# 5. Or use direct links:
#    - Getting Started: [Direct Colab Link]
#    - Data Processing: [Direct Colab Link] 
#    - Model Training: [Direct Colab Link]
```

#### **Option 3: Direct Upload to Existing Instance**
```bash
# If you already have a Vertex AI Workbench instance:
# 1. Download notebooks from GitHub
# 2. Upload via JupyterLab interface
# 3. Or use terminal in Workbench:
!git clone https://github.com/farishussain/GCP_MLOps.git
!cd GCP_MLOps && ls -la notebooks/
```

### ‚öôÔ∏è Cloud Environment Setup:

When running on Vertex AI, the notebooks will automatically:
- ‚úÖ Use pre-installed Google Cloud libraries
- ‚úÖ Inherit authentication from the compute instance  
- ‚úÖ Have direct access to Vertex AI services
- ‚úÖ Connect to your project's resources seamlessly

### üîß Required Permissions:
Ensure your Workbench service account has:
- **Vertex AI User** role
- **Storage Admin** role  
- **BigQuery Admin** role
- **Service Account User** role

### üí° Pro Tips for Cloud Execution:
- Start with a **small instance** (n1-standard-2) for development
- **Scale up** to GPU instances for intensive training
- Use **Vertex AI managed notebooks** for automatic updates
- **Save work frequently** - instances can be preempted

---

In [14]:
# üåê Automatic Cloud Environment Detection & Setup
print("üåê Detecting execution environment...")

import os
import sys
import subprocess
import json
from pathlib import Path

def detect_environment():
    """Detect if running on Vertex AI Workbench, Colab, or local"""
    
    # Check for Google Colab
    try:
        import google.colab
        return "colab"
    except ImportError:
        pass
    
    # Check for Vertex AI Workbench (several indicators)
    vertex_indicators = [
        os.path.exists("/opt/conda"),  # Conda environment
        os.path.exists("/opt/nvidia"),  # NVIDIA drivers
        "jupyter" in os.environ.get("_", ""),  # Jupyter environment
        os.environ.get("DLT_DOCKER_IMAGE"),  # Deep Learning container
        os.environ.get("WORKBENCH_NAME")  # Workbench specific
    ]
    
    if any(vertex_indicators):
        return "vertex_workbench"
    
    # Check for generic cloud environment
    cloud_indicators = [
        os.environ.get("GOOGLE_CLOUD_PROJECT"),
        os.environ.get("GCLOUD_PROJECT"),
        os.path.exists("/usr/bin/gcloud")
    ]
    
    if any(cloud_indicators):
        return "cloud_shell_or_gce"
    
    return "local"

# Detect environment
environment = detect_environment()
print(f"   Environment detected: {environment.upper()}")

# Auto-setup based on environment
if environment == "colab":
    print("   üîß Setting up Google Colab environment...")
    
    # Install additional packages if needed
    try:
        !pip install -q google-cloud-aiplatform google-cloud-storage kfp
        print("      ‚úÖ Installed cloud packages")
    except:
        print("      ‚ö†Ô∏è Some packages may already be installed")
    
    # Authenticate (will prompt for auth code)
    from google.colab import auth
    auth.authenticate_user()
    print("      ‚úÖ Authentication complete")
    
elif environment == "vertex_workbench":
    print("   üîß Setting up Vertex AI Workbench environment...")
    
    # Workbench comes pre-configured, but verify setup
    try:
        import google.auth
        credentials, project_id = google.auth.default()
        print(f"      ‚úÖ Auto-authenticated with project: {project_id}")
        
        # Set environment variables
        os.environ['GOOGLE_CLOUD_PROJECT'] = project_id
        
    except Exception as e:
        print(f"      ‚ö†Ô∏è Authentication issue: {e}")
        
    print("      ‚úÖ Vertex AI Workbench ready")
    
elif environment == "cloud_shell_or_gce":
    print("   üîß Setting up Cloud Shell/GCE environment...")
    
    try:
        # Get project from metadata server
        result = subprocess.run([
            'curl', '-s', '-H', 'Metadata-Flavor: Google',
            'http://metadata.google.internal/computeMetadata/v1/project/project-id'
        ], capture_output=True, text=True, timeout=5)
        
        if result.returncode == 0:
            project_id = result.stdout.strip()
            os.environ['GOOGLE_CLOUD_PROJECT'] = project_id
            print(f"      ‚úÖ Auto-detected project: {project_id}")
        
    except Exception:
        print("      ‚ö†Ô∏è Could not auto-detect project")
    
else:
    print("   üíª Local environment detected")
    print("      üìù Follow the GCP setup steps above for cloud features")

# Auto-install missing packages for cloud environments
if environment in ["colab", "vertex_workbench", "cloud_shell_or_gce"]:
    print("\nüì¶ Verifying cloud packages...")
    
    required_cloud_packages = [
        "google-cloud-aiplatform",
        "google-cloud-storage",
        "google-cloud-bigquery",
        "kfp"
    ]
    
    missing_packages = []
    for package in required_cloud_packages:
        try:
            __import__(package.replace('-', '.'))
        except ImportError:
            missing_packages.append(package)
    
    if missing_packages:
        print(f"   üîÑ Installing missing packages: {missing_packages}")
        for package in missing_packages:
            try:
                subprocess.run([sys.executable, "-m", "pip", "install", package], 
                             check=True, capture_output=True)
                print(f"      ‚úÖ Installed {package}")
            except subprocess.CalledProcessError as e:
                print(f"      ‚ùå Failed to install {package}")
    else:
        print("   ‚úÖ All cloud packages available")

# Set global configuration
CLOUD_ENVIRONMENT = environment
IS_CLOUD_EXECUTION = environment in ["colab", "vertex_workbench", "cloud_shell_or_gce"]

print(f"\nüéØ Environment Setup Complete!")
print(f"   Cloud Execution: {'‚úÖ YES' if IS_CLOUD_EXECUTION else '‚ùå NO'}")
print(f"   Authentication: {'‚úÖ Auto' if IS_CLOUD_EXECUTION else 'üîß Manual Required'}")

if IS_CLOUD_EXECUTION:
    print(f"   üí° All notebooks will run with full cloud integration!")
else:
    print(f"   üí° Some cloud features may require additional setup")

# Store environment info for other notebooks
cloud_config = {
    'environment': CLOUD_ENVIRONMENT,
    'is_cloud': IS_CLOUD_EXECUTION,
    'auto_auth': IS_CLOUD_EXECUTION,
    'project_auto_detected': os.environ.get('GOOGLE_CLOUD_PROJECT') is not None
}

print(f"\n‚úÖ Cloud environment configuration stored for all notebooks")

üåê Detecting execution environment...
   Environment detected: CLOUD_SHELL_OR_GCE
   üîß Setting up Cloud Shell/GCE environment...

üì¶ Verifying cloud packages...
   ‚úÖ All cloud packages available

üéØ Environment Setup Complete!
   Cloud Execution: ‚úÖ YES
   Authentication: ‚úÖ Auto
   üí° All notebooks will run with full cloud integration!

‚úÖ Cloud environment configuration stored for all notebooks
