# Deploying a Scikit-learn Pipeline for Online Inference with Vertex AI

### Introduction

In this notebook, we will walk through the complete, end-to-end process of deploying a machine learning model on Google Cloud for real-time predictions.

Our goal is to take a logistic regression model, designed to predict the sentiment of customer reviews, and make it available as a live, scalable service using Vertex AI. This is a foundational workflow in modern Machine Learning Operations (MLOps).

### Workflow Overview

We will follow these key steps, simulating a professional MLOps lifecycle:

1.  **Setup & Data Preparation:** Load the dataset, clean it, and split it for training and testing.
2.  **Model Training with a Pipeline:** Build, train, and evaluate a scikit-learn `Pipeline` that combines our text vectorizer and classifier into a single, robust object.
3.  **Saving the Model Artifact:** Save the validated pipeline to a single `model.joblib` file.
4.  **Deploying to Vertex AI:**
    * Upload the model artifact to Google Cloud Storage.
    * Import the model into the Vertex AI Model Registry.
    * Deploy the registered model to a live Vertex AI Endpoint.
5.  **Getting Online Inferences:** Use the Vertex AI Python SDK to send a new review to our deployed endpoint and receive a real-time sentiment prediction.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import joblib
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
file_path = 'womens_clothing_ecommerce_reviews.csv'
df = pd.read_csv(file_path)
print("✅ Successfully loaded the dataset.")
print("Dataset preview:")
print(df.head())

✅ Successfully loaded the dataset.
Dataset preview:
                                         Review Text  sentiment
0  Absolutely wonderful - silky and sexy and comf...          1
1  Love this dress!  it's sooo pretty.  i happene...          1
2  I love, love, love this jumpsuit. it's fun, fl...          1
3  This shirt is very flattering to all due to th...          1
4  I love tracy reese dresses, but this one is no...         -1


In [3]:
# Get some basic information about the dataset
print("\nDataset Information:")
df.info()


Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19818 entries, 0 to 19817
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Review Text  19818 non-null  object
 1   sentiment    19818 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 309.8+ KB


In [4]:
# --- Step 2: Split Data into Training and Testing Sets ---
# It's crucial to test our model on data it has never seen before.
# We'll use 80% of the data for training and 20% for testing.
X = df['Review Text']
y = df['sentiment']

# 'stratify=y' ensures that the proportion of positive and negative reviews is the same in both your training set and your testing set.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"\nData split into {len(X_train)} training samples and {len(X_test)} testing samples.")


Data split into 15854 training samples and 3964 testing samples.


In [5]:
# --- Define the pipeline ---
# You put your fully configured vectorizer directly into the pipeline's steps.
# All your settings, like 'stop_words' and 'max_features', go here.
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(stop_words='english')),
    ('classifier', LogisticRegression(max_iter=5000, solver='saga', random_state=42))
])


# 2. TRAIN THE ENTIRE PIPELINE
# You fit the pipeline on the raw text data. It handles the rest internally.
# The pipeline will now use your configured vectorizer internally.
print("Training the entire pipeline...")
pipeline.fit(X_train, y_train)
print("✅ Pipeline training complete.")

Training the entire pipeline...
✅ Pipeline training complete.


In [6]:
# --- 2. EVALUATE THE PIPELINE (The Quality Gate) ---
# Use the trained pipeline to make predictions on the unseen test data.
# The pipeline automatically handles the .transform() step for X_test.
print("\nEvaluating the pipeline on the test data...")
predictions = pipeline.predict(X_test)

# Calculate the performance. You can use any metric, like accuracy.
score = accuracy_score(y_test, predictions)
print(f"Model Accuracy on the test set: {score:.4f}")


Evaluating the pipeline on the test data...
Model Accuracy on the test set: 0.9299


In [7]:
# 3. SAVE THE SINGLE, COMPLETE PIPELINE ARTIFACT
# This object now contains your CountVectorizer
# AND your trained LogisticRegression model.
# This is the file you need to deploy. It knows how to handle raw text.
print("Saving the pipeline to model.joblib...")
joblib.dump(pipeline, 'model.joblib')

Saving the pipeline to model.joblib...


['model.joblib']

In [8]:
from google.cloud import aiplatform

In [None]:
# --- 1. SET YOUR VARIABLES ---
# Replace these with your actual project details from the Google Cloud console.
PROJECT_ID = "nice-test-470503"
LOCATION = "asia-southeast1"
ENDPOINT_ID = "1762528244441205760" # The ID of your endpoint


# --- 2. INITIALIZE THE CLIENT ---
# This sets up the connection to your project.
aiplatform.init(project=PROJECT_ID, location=LOCATION)

# --- 3. CREATE THE ENDPOINT OBJECT ---
# This creates a local Python object that is a remote control for your live endpoint.
endpoint = aiplatform.Endpoint(endpoint_name=ENDPOINT_ID)


In [10]:
# --- 2. THE NEW DATA ---
# This is the new, raw text review we want to classify.
# Note that it must be inside a list.
new_review = ["I recently purchased this dress and I have to say, it exceeded my expectations. The fabric feels soft yet durable, and it has just the right amount of stretch to make it really comfortable for long wear. The stitching and finishing are neat, giving it a very polished look."]
print(f"Sending new review to the model: '{new_review[0]}'")


# --- 5. MAKE THE PREDICTION CALL ---
# The .predict() method sends the data to the live model.
# The 'instances' argument must be a list.
print("\n...Calling the endpoint...")
response = endpoint.predict(instances=new_review)

print(response)

# --- 6. PARSE AND DISPLAY THE RESPONSE ---
print("...Received a response.")

# The prediction result is stored in the .predictions attribute of the response.
# Since we sent one review, we get one result at index [0].
prediction_result = response.predictions[0]

# Assuming your model was trained on labels where 1 is 'Positive' and 0 is 'Negative'.
sentiment = "Positive" if prediction_result == 1 else "Negative"

print("----------------------------------------")
print(f"✅ Prediction: The review sentiment is '{sentiment}'")
print("----------------------------------------")

Sending new review to the model: 'I recently purchased this dress and I have to say, it exceeded my expectations. The fabric feels soft yet durable, and it has just the right amount of stretch to make it really comfortable for long wear. The stitching and finishing are neat, giving it a very polished look.'

...Calling the endpoint...
Prediction(predictions=[1], deployed_model_id='4742275014458343424', metadata=None, model_version_id='1', model_resource_name='projects/790592728786/locations/asia-southeast1/models/5690847884996509696', explanations=None)
...Received a response.
----------------------------------------
✅ Prediction: The review sentiment is 'Positive'
----------------------------------------


In [11]:
new_reviews = [
    "I am so disappointed with this purchase, I will be returning it.",
    "The material felt cheap and it was not what I expected.",
    "It's an okay product, not great but not terrible either.",
    "This dress is absolutely beautiful and fits perfectly!",
    "I recently purchased this dress and I have to say, it exceeded my expectations. The fabric feels soft yet durable, and it has just the right amount of stretch to make it really comfortable for long wear. The stitching and finishing are neat, giving it a very polished look."
]

print("\nSending prediction request to the Vertex AI endpoint...")
# Make the prediction call
response = endpoint.predict(instances=new_reviews)

print(response)
print(type(response))

prediction_results = response.predictions
print(prediction_results)


Sending prediction request to the Vertex AI endpoint...
Prediction(predictions=[-1, -1, -1, 1, 1], deployed_model_id='4742275014458343424', metadata=None, model_version_id='1', model_resource_name='projects/790592728786/locations/asia-southeast1/models/5690847884996509696', explanations=None)
<class 'google.cloud.aiplatform.models.Prediction'>
[-1, -1, -1, 1, 1]
