Copyright 2023 Google LLC.

SPDX-License-Identifier: Apache-2.0

# **Semi-Structured Application Logs with BQML & PaLM**

In this colab we'll go through how to connect and put the nginx access logs (from [Kaggle](https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs?resource=download&select=access.log)) to BigQuery and analysis via PaLM.

Reference: https://cloud.google.com/bigquery/docs/generate-text-tutorial

# **Costs**

In this document, you use the following billable components of Google Cloud:

* BigQuery: You incur costs for the data that you process in BigQuery, including query and storage.
* Vertex AI: You incur costs for calls to the Vertex AI service that's represented by the remote model.
* Cloud Storage: Your incur costs for storage staging access log file in JSON format before import into BigQuery.

# **Workshop Format**

This is a part of hands-on workshop of DevFest 2023

Self Link: https://bit.ly/df23-log-analysis-bqml-palm

For supplement materials, please refer to the [Presentation](https://www.google.com/)

# Setup

Please login with same credential of your GCP project owner (or have enough permission to execute billable actions)

In [None]:
import os
import sys

if "google.colab" in sys.modules:
    from google.colab import auth as google_auth

    google_auth.authenticate_user()
    print("Authenticated")

Authenticated


Please modifiy following parameters to fits your own GCP environment, at least PROJECT_ID and GCS_BUCKET

In [None]:
PROJECT_ID = 'colab-bqml-palm-demo' # @param {type:"string"}
GCS_BUCKET = 'gs://colab-bqml-palm-demo' # @param {type:"string"}
GCS_FILENAME = 'accesslog.json' # @param {type:"string"}
BQ_DATASET_ID = 'kaggle_eliasdabbas_web_server_access_logs' # @param {type:"string"}
BQ_TABLE_NAME = 'access_log' # @param {type:"string"}
BQ_LOCATION = 'us' # @param {type:"string"}
BQ_MODEL_NAME = 'colabs_llm' # @param {type:"string"}

!gcloud services --project $PROJECT_ID \
  enable bigquery.googleapis.com \
  bigqueryconnection.googleapis.com \
  aiplatform.googleapis.com

Operation "operations/acat.p2-436800949134-a97f82ad-ac90-457f-8ddc-2f2a7e368158" finished successfully.


# Prepare Datasource (1/3)

*You may skip if you already done this part*

* Download the data from https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs/download?datasetVersionNumber=2 (280MB, 3.5GB uncompressed)
* Put access.log from the compress file above into your GCS bucket


# Approach 1 - Download from Kaggle

Note: You need to signup Kggle and create your own API Token

In [None]:
# Note: You need to provide your Kaggle API token
# Kaggle -> Account -> Create New Token (then you'll get kaggle.json)

!pip install -q kaggle
from google.colab import files
files.upload()
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


In [None]:
!rm -fr web-server-access-logs.zip
!kaggle datasets download -d eliasdabbas/web-server-access-logs
!unzip -o web-server-access-logs.zip access.log
!awk '{ gsub(/"/, "\\\""); print "{\"payload\":\""$0"\"}"; }' access.log > accesslog.json
!rm -fr access.log

Downloading web-server-access-logs.zip to /content
 96% 257M/267M [00:02<00:00, 40.2MB/s]
100% 267M/267M [00:02<00:00, 95.4MB/s]
Archive:  web-server-access-logs.zip
  inflating: access.log              


In [None]:
!gsutil mb -p $PROJECT_ID $GCS_BUCKET
!gsutil -m cp $GCS_FILENAME $GCS_BUCKET
!rm -fr $GCS_FILENAME

Creating gs://colab-bqml-palm-demo/...
ServiceException: 409 A Cloud Storage bucket named 'colab-bqml-palm-demo' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.
Copying file://accesslog.json [Content-Type=application/json]...
==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objec

# Approach 2 - Download by yourself

Browse and download from the Kaggle dataset above, uncompress it and upload `access.log` manually to the GCS bucket.

In [None]:
!gsutil stat $GCS_BUCKET/$GCS_FILENAME

gs://colab-bqml-palm-demo/accesslog.json:
    Creation time:          Fri, 22 Sep 2023 06:51:28 GMT
    Update time:            Fri, 22 Sep 2023 06:51:28 GMT
    Storage class:          STANDARD
    Content-Length:         3730474167
    Content-Type:           application/json
    Hash (crc32c):          SzlPuQ==
    Hash (md5):             G4HPKceGlSUyCExvXmtlhQ==
    ETag:                   COPS0s3QvYEDEAE=
    Generation:             1695365488486755
    Metageneration:         1


# Prepare Datasource (2/3)

*You may skip if you already done this part*

Load/Import the data from GCS bucket into BigQuery Dataset

In [None]:
!bq --location=$BQ_LOCATION mk --dataset $PROJECT_ID:$BQ_DATASET_ID
!bq --location=$BQ_LOCATION \
    --project_id=$PROJECT_ID \
    load \
    --replace \
    --source_format=NEWLINE_DELIMITED_JSON \
    --schema='payload:STRING' \
    $PROJECT_ID:$BQ_DATASET_ID"."$BQ_TABLE_NAME \
    $GCS_BUCKET'/'$GCS_FILENAME


BigQuery error in mk operation: Dataset 'colab-bqml-palm-
demo:kaggle_eliasdabbas_web_server_access_logs' already exists.
Waiting on bqjob_r4895523e3cc46f17_0000018abba9e17f_1 ... (15s) Current status: DONE   


# Create BigLake (3/3)

*You may skip if you already done this part*

Create BigLake and remote functions (Cloud Resource)

In [None]:
!bq mk \
  --connection \
  --location=$BQ_LOCATION \
  --project_id=$PROJECT_ID \
  --connection_type=CLOUD_RESOURCE \
  colab_biglake

Connection 436800949134.us.colab_biglake successfully created


In [None]:
!DEBIAN_FRONTEND=noninteractive apt-get -y -qq install --upgrade jq
BQ_CONNECTION_SA_ = !bq show \
  --connection \
  --project_id=$PROJECT_ID \
  --location=$BQ_LOCATION \
  --format=prettyjson \
  colab_biglake | jq -r '.cloudResource.serviceAccountId'
BQ_CONNECTION_SA = BQ_CONNECTION_SA_[0]
!gcloud projects \
  add-iam-policy-binding \
  $PROJECT_ID \
  --member='serviceAccount:'$BQ_CONNECTION_SA \
  --role='roles/aiplatform.user'

Updated IAM policy for project [colab-bqml-palm-demo].
bindings:
- members:
  - serviceAccount:service-436800949134@gcp-sa-aiplatform-cc.iam.gserviceaccount.com
  role: roles/aiplatform.customCodeServiceAgent
- members:
  - serviceAccount:service-436800949134@gcp-sa-aiplatform-vm.iam.gserviceaccount.com
  role: roles/aiplatform.notebookServiceAgent
- members:
  - serviceAccount:service-436800949134@gcp-sa-aiplatform.iam.gserviceaccount.com
  role: roles/aiplatform.serviceAgent
- members:
  - deleted:serviceAccount:bqcx-436800949134-6v0l@gcp-sa-bigquery-condel.iam.gserviceaccount.com?uid=112471142346029776235
  - deleted:serviceAccount:bqcx-436800949134-xbui@gcp-sa-bigquery-condel.iam.gserviceaccount.com?uid=107342139465376552440
  - serviceAccount:bqcx-436800949134-1wug@gcp-sa-bigquery-condel.iam.gserviceaccount.com
  role: roles/aiplatform.user
- members:
  - user:admin@pingda.altostrat.com
  role: roles/billing.projectManager
- members:
  - serviceAccount:service-436800949134@compute

# Create a Model on BigQuery

In [None]:
from google.cloud import bigquery

client = bigquery.Client(project=PROJECT_ID)
sql_query = """
CREATE OR REPLACE MODEL
    `{bq_dataset_id}.{bq_model_name}`
    REMOTE WITH CONNECTION `{project_id}.{bq_location}.colab_biglake`
    OPTIONS (REMOTE_SERVICE_TYPE="CLOUD_AI_LARGE_LANGUAGE_MODEL_V1");
""".format(bq_dataset_id=BQ_DATASET_ID,
           project_id=PROJECT_ID,
           bq_location=BQ_LOCATION,
           bq_model_name=BQ_MODEL_NAME)

job = client.query(sql_query)
job.result()

if job.state == 'DONE':
  print("Model Created: %s.colabs_llm" % (BQ_DATASET_ID))


Model Created: kaggle_eliasdabbas_web_server_access_logs.colabs_llm


# Let's do some real query!

*Note: If you experiences permission issues for executing the query, kindly wait few more minutes until IAM settings are propagated*

In [None]:
# @title Ask for client's platform type
prompt = "Desktop or mobile browser?" # @param {type:"string"}
payload_limit = 10 # @param {type:"integer"}
temperature = 0.2 # @param {type:"number"}
max_output_tokens = 650 # @param {type:"integer"}
top_p = 0.2 # @param {type:"number"}
top_k = 15 # @param {type:"number"}

from google.cloud import bigquery
import json

client = bigquery.Client(project=PROJECT_ID)

sql_query = """
SELECT *
FROM
  ML.GENERATE_TEXT(
    MODEL `{bq_dataset_id}.{bq_model_name}`,
    (
      SELECT CONCAT('{prompt}: ', a.payload) AS prompt,
      *
      FROM `{bq_dataset_id}.{bq_table_name}` as a
      LIMIT {payload_limit}
    ),
    STRUCT(
      {temperature} AS temperature,
      {max_output_tokens} AS max_output_tokens,
      {top_p} AS top_p,
      {top_k} AS top_k
    )
  );
""".format(
    bq_dataset_id=BQ_DATASET_ID,
    bq_table_name=BQ_TABLE_NAME,
    bq_location=BQ_LOCATION,
    bq_model_name=BQ_MODEL_NAME,
    payload_limit=payload_limit,
    prompt=prompt,
    temperature=temperature,
    max_output_tokens=max_output_tokens,
    top_p=top_p,
    top_k=top_k)

job = client.query(sql_query)
results = job.result()

for row in results:
  _result = json.loads(row['ml_generate_text_result'])
  print("""Prompt: {prompt}
Result: {result}
""".format(
    prompt=row['prompt'],
    result=_result['predictions'][0]['content'])
)


Prompt: Desktop or mobile browser?: 10.249.34.197 - - [23/Jan/2019:13:34:17 +0330] "GET /image/64998/productModel/150x150 HTTP/1.1" 200 4997 "https://www.zanbil.ir/filter/p62,stexists" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36" "-"
Result:  Desktop

Prompt: Desktop or mobile browser?: 104.156.210.198 - - [25/Jan/2019:02:34:31 +0330] "POST /ajaxFilter/p28609,b136 HTTP/1.1" 200 4121 "https://www.zanbil.ir/filter/p28609" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36" "-"
Result:  Desktop

Prompt: Desktop or mobile browser?: 104.194.24.17 - - [25/Jan/2019:10:37:23 +0330] "GET /image/399/brand HTTP/1.1" 200 2302 "https://www.zanbil.ir/browse/wall-oven/%D9%81%D8%B1-%D8%AA%D9%88%DA%A9%D8%A7%D8%B1" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0" "-"
Result:  Desktop

Prompt: Desktop or mobile browser?: 104.194.24.190 - - [22/Jan/2019