# Enterprise ML Pipeline on GCP - Demo Notebook

This notebook demonstrates the usage of the Enterprise ML Pipeline built on Google Cloud Platform. It covers the following steps:

1. Setting up the environment
2. Generating and uploading synthetic data to BigQuery
3. Data validation and preprocessing
4. Running the ML pipeline locally
5. Model evaluation
6. Making predictions with the deployed model

Let's get started!

## 1. Setup and Configuration

First, let's set up our environment and import the necessary libraries.

In [None]:
# Import required libraries
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from google.cloud import bigquery
from google.cloud import aiplatform

# Add the project root to the path
sys.path.append('..')

# Import project modules
from src.data.bigquery_utils import BigQueryClient
from src.data.data_validation import DataValidator
from src.pipeline.config import PipelineConfig
from src.models.evaluation import ModelEvaluator
from src.serving.prediction import PredictionService

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Display TensorFlow version
print(f'TensorFlow version: {tf.__version__}')

### Configure GCP Project

Set your Google Cloud Platform project ID and region.

In [None]:
# Set your GCP project ID and region
PROJECT_ID = 'your-gcp-project-id'  # Replace with your project ID
REGION = 'us-central1'  # Replace with your preferred region

# Set environment variables
os.environ['GOOGLE_CLOUD_PROJECT'] = PROJECT_ID
os.environ['GOOGLE_CLOUD_REGION'] = REGION

# Initialize Vertex AI SDK
aiplatform.init(project=PROJECT_ID, location=REGION)

# Create pipeline config
config = PipelineConfig(
    project_id=PROJECT_ID,
    region=REGION,
    pipeline_name='retail-sales-pipeline'
)

print(f'Project ID: {PROJECT_ID}')
print(f'Region: {REGION}')
print(f'Pipeline root: {config.pipeline_root}')

## 2. Generate and Upload Synthetic Data

Let's generate synthetic retail data and upload it to BigQuery.

In [None]:
from src.data.upload_to_bigquery import generate_retail_data

# Generate synthetic data
num_samples = 10000
df = generate_retail_data(num_samples)

# Display the first few rows
print(f'Generated {num_samples} synthetic retail records')
df.head()

In [None]:
# Explore data statistics
df.describe()

In [None]:
# Visualize the distribution of total_amount (target variable)
plt.figure(figsize=(10, 6))
sns.histplot(df['total_amount'], kde=True)
plt.title('Distribution of Total Amount')
plt.xlabel('Total Amount')
plt.ylabel('Count')
plt.show()

# Visualize relationship between quantity and total_amount
plt.figure(figsize=(10, 6))
sns.scatterplot(x='quantity', y='total_amount', data=df, alpha=0.5)
plt.title('Quantity vs Total Amount')
plt.xlabel('Quantity')
plt.ylabel('Total Amount')
plt.show()

In [None]:
# Initialize BigQuery client
bq_client = BigQueryClient(project_id=PROJECT_ID)

# Create dataset if it doesn't exist
dataset_id = 'retail_dataset'
bq_client.create_dataset(dataset_id)

# Upload data to BigQuery
table_id = 'retail_sales'
bq_client.upload_dataframe_to_table(df, dataset_id, table_id)

print(f'Data uploaded to {PROJECT_ID}.{dataset_id}.{table_id}')

## 3. Data Validation and Preprocessing

Let's validate the data using TensorFlow Data Validation (TFDV).

In [None]:
# Initialize data validator
data_validator = DataValidator()

# Generate statistics for the dataset
stats = data_validator.generate_statistics(df)

# Infer schema from statistics
schema = data_validator.infer_schema(stats)

# Display statistics and schema
data_validator.display_statistics(stats)
data_validator.display_schema(schema)

In [None]:
# Split data into training and evaluation sets
from sklearn.model_selection import train_test_split

train_df, eval_df = train_test_split(df, test_size=0.2, random_state=42)

print(f'Training set size: {len(train_df)}')
print(f'Evaluation set size: {len(eval_df)}')

# Generate statistics for training and evaluation sets
train_stats = data_validator.generate_statistics(train_df)
eval_stats = data_validator.generate_statistics(eval_df)

# Compare statistics
data_validator.compare_statistics(train_stats, eval_stats, schema)

## 4. Running the ML Pipeline Locally

Now, let's run the ML pipeline locally using TFX.

In [None]:
# Save training and evaluation data to CSV for the pipeline
data_dir = os.path.join('..', 'data')
os.makedirs(data_dir, exist_ok=True)

train_path = os.path.join(data_dir, 'train.csv')
eval_path = os.path.join(data_dir, 'eval.csv')

train_df.to_csv(train_path, index=False)
eval_df.to_csv(eval_path, index=False)

print(f'Training data saved to {train_path}')
print(f'Evaluation data saved to {eval_path}')

In [None]:
# Import the pipeline runner
from src.pipeline.run_pipeline import create_pipeline, run_pipeline

# Run the pipeline locally
# Note: This cell will execute the pipeline, which may take some time
# Uncomment the following lines to run the pipeline

# run_pipeline(
#     config=config,
#     data_path=data_dir,
#     mode='local',
#     enable_cache=True
# )

print('To run the pipeline, execute the following command in the terminal:')
print(f'python ../src/pipeline/run_pipeline.py --project_id={PROJECT_ID} --region={REGION} --data_path={data_dir} --mode=local')

## 5. Model Evaluation

Let's evaluate the trained model.

In [None]:
# Load the trained model
model_path = os.path.join(config.serving_model_dir, 'latest')

# Check if model exists
if os.path.exists(model_path):
    model = tf.keras.models.load_model(model_path)
    print(f'Model loaded from {model_path}')
    
    # Display model summary
    model.summary()
else:
    print(f'Model not found at {model_path}. Please run the pipeline first.')

In [None]:
# If model exists, evaluate it on the evaluation dataset
if 'model' in locals():
    # Prepare evaluation data
    # Note: This is a simplified evaluation. In practice, you would need to preprocess the data
    # using the same transformations applied during training.
    
    # Initialize model evaluator
    evaluator = ModelEvaluator()
    
    # Generate evaluation report
    # This is a placeholder - in a real scenario, you would use the proper transformed data
    print('To evaluate the model properly, you should use the TFX Evaluator component outputs.')

## 6. Making Predictions with the Deployed Model

Let's use the model to make predictions.

In [None]:
# Create sample input data
sample_input = {
    'quantity': 5,
    'unit_price': 50.0,
    'discount': 0.1,
    'customer_age': 35,
    'transaction_hour': 14,
    'customer_id': 'CUST_1001',
    'product_id': 'PROD_5432',
    'customer_gender': 'M',
    'store_id': 'STORE_01',
    'payment_method': 'Credit Card',
    'customer_segment': 'Regular',
    'transaction_day': 'Monday',
    'transaction_month': 'January'
}

# If model exists locally, use PredictionService
if os.path.exists(model_path):
    prediction_service = PredictionService(model_path)
    
    # Make prediction
    try:
        prediction = prediction_service.predict_single(sample_input)
        print(f'Predicted total amount: ${prediction:.2f}')
    except Exception as e:
        print(f'Error making prediction: {e}')
else:
    print('Model not available locally. You can deploy it to Cloud Run for online predictions.')

### Deploying the Model to Cloud Run

To deploy the model to Cloud Run for online predictions, you can use the deployment script.

In [None]:
# Display the command to deploy the model
print('To deploy the model to Cloud Run, execute the following command in the terminal:')
print(f'bash ../scripts/deploy_model.sh {PROJECT_ID} {REGION} {model_path}')

## Conclusion

In this notebook, we demonstrated the key components of our Enterprise ML Pipeline on GCP:

1. Setting up the environment and configuring GCP resources
2. Generating and uploading synthetic retail data to BigQuery
3. Validating and preprocessing the data using TensorFlow Data Validation
4. Running the TFX pipeline locally
5. Evaluating the trained model
6. Making predictions with the deployed model

This pipeline demonstrates a production-ready approach to machine learning workflows, incorporating best practices for data validation, model training, evaluation, and deployment.