# Office cloud task for Google Cloud ML Setup and Deployment

This notebook sets up the necessary Google Cloud Platform (GCP) services for machine learning deployment. It configures:

- **Google Cloud Storage (GCS)** - For storing model artifacts and data
- **Google Cloud AI Platform** - For model training and deployment
- **BigQuery** - For data storage and analysis
- **Project configuration** - Sets up the GCP project and region

The setup process includes creating a storage bucket and initializing the AI Platform with the necessary credentials and configurations.

In [1]:
# Import required libraries for Google Cloud ML services
# These libraries provide access to GCP's machine learning and data services

from google.cloud import aiplatform  # For model training and deployment
from google.cloud import storage     # For cloud storage operations
from google.cloud import bigquery    # For data warehouse operations

import pandas as pd                  # For data manipulation

In [2]:
# Configure GCP project settings
# This cell sets up the basic configuration for your GCP project

# Get the current GCP project ID from gcloud CLI
project = !gcloud config get-value project
PROJECT_ID = project[0]

# Set the region for AI Platform services (us-central1 is cost-effective)
LOCATION = 'us-central1'

# Define the storage bucket name for storing model artifacts
BUCKET = 'cloud-office-ml-bucket'

In [3]:
# Initialize Google Cloud service clients
# These clients will be used to interact with GCS and BigQuery services

gcs = storage.Client(project = PROJECT_ID)  # Google Cloud Storage client
bq = bigquery.Client(project = PROJECT_ID)  # BigQuery client for data operations

In [4]:
# Create or verify Google Cloud Storage bucket
# This bucket will store model artifacts, training data, and other ML assets

if not gcs.lookup_bucket(BUCKET):
    # Create new bucket if it doesn't exist
    bucketDef = gcs.bucket(BUCKET)
    bucket = gcs.create_bucket(bucketDef, project=PROJECT_ID, location=LOCATION)
    print(f'Created Bucket: {gcs.lookup_bucket(BUCKET).name}')
else:
    # Use existing bucket if it already exists
    bucket = gcs.bucket(BUCKET)
    print(f'Bucket already exist: {bucket.name}')

Bucket already exist: cloud-office-ml-bucket


In [5]:
# Create bucket URI for AI Platform configuration
# This URI format is required by Google Cloud AI Platform services

BUCKET_URI = f"gs://{bucket.name}"

In [6]:
# Initialize Google Cloud AI Platform
# This sets up the AI Platform with your project settings and staging bucket
# The staging bucket is where training artifacts and model files will be stored

aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_URI)

In [7]:
MODEL_ARTIFACT_DIR = "coml-artifact-dir"
REPOSITORY = "coml-repository-name"
IMAGE = "coml-image-name"
MODEL_DISPLAY_NAME = "coml-model-display-name"

# Set the defaults if no names were specified
if MODEL_ARTIFACT_DIR == "[your-artifact-directory]":
    MODEL_ARTIFACT_DIR = "custom-container-prediction-model"

if REPOSITORY == "[your-repository-name]":
    REPOSITORY = "custom-container-prediction"

if IMAGE == "[your-image-name]":
    IMAGE = "sklearn-fastapi-server"

if MODEL_DISPLAY_NAME == "[your-model-display-name]":
    MODEL_DISPLAY_NAME = "sklearn-custom-container"

In [8]:
%mkdir app

mkdir: cannot create directory ‘app’: File exists


In [9]:
%%writefile app/preprocess.py

import pandas as pd

class Preprocessor():
    def __init__(self):
        self.numerical = ['tenure', 'monthlycharges', 'totalcharges']
        self.categorical = [
            'gender',
            'seniorcitizen',
            'partner',
            'dependents',
            'phoneservice',
            'multiplelines',
            'internetservice',
            'onlinesecurity',
            'onlinebackup',
            'deviceprotection',
            'techsupport',
            'streamingtv',
            'streamingmovies',
            'contract',
            'paperlessbilling',
            'paymentmethod',
        ]

    def preprocess(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Preprocess the raw dataframe.
        Args:
            df_raw: Raw dataframe to preprocess.

        Returns:
            Preprocessed dataframe.
        """

        df.columns = [col.lower().replace(' ', '_') for col in df.columns]
        df = df[self.categorical + self.numerical + ['churn']]
        df.churn = (df.churn == 'Yes').astype(int)

        for col in df.columns:
            if df[col].dtype == 'object':
                df[col] = df[col].str.lower().str.replace(' ', '_')
                
        df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
        df[self.numerical] = df[self.numerical].fillna(0)

        return df

Overwriting app/preprocess.py


In [12]:
%cd notebooks

/home/ev/cloud-office-ml/notebooks


In [14]:
import pickle

import joblib
from app.preprocess import Preprocessor
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

dv = DictVectorizer()
prp = Preprocessor()

BUCKET = 'cloud-office-ml-bucket'

df_raw = pd.read_csv(f'gs://{BUCKET}/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df_processed = prp.preprocess(df_raw)

y_train = df_processed['churn']
X_train = df_processed.drop('churn', axis=1)

train_dict = X_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)

joblib.dump(model, "model.joblib")
with open("preprocessor.pkl", "wb") as f:
    pickle.dump(prp, f)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.churn = (df.churn == 'Yes').astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = df[col].str.lower().str.replace(' ', '_')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
A value is trying to be set on a copy of a s