<a href="https://colab.research.google.com/github/buaindra/gcp_utility/blob/main/gcp/data_pipeline_poc/BQ_ML_Log_Analysis/log_analysis_gcp_ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Log data analysisusing BQ-ML

## Sample Log Analysis blogs
1. https://aws.amazon.com/blogs/opensource/introducing-aws-security-analytics-bootstrap/

## Dataset Ref
1. DDOS Network Logs: https://www.kaggle.com/datasets/jacobvs/ddos-attack-network-logs

### How to download data using kaggle api
#### ref: https://github.com/Kaggle/kaggle-api
1. sign-in to kaggle site.
2. hover the mouse over your profile photo.
3. select "account".
4. then go-to api section and click on "create new api token".
5. it will download the kaggle.json file which has kaggle user id and kaggle key.

In [None]:
!pip install kaggle --upgrade

In [None]:
!export KAGGLE_USERNAME=buaindra
!export KAGGLE_KEY=15b6f914804e6beef1e16662d53de05a

# Google Client Approach

## Google CLoud Storage
1. Google Storage Doc: https://cloud.google.com/storage/docs/gsutil/commands/mb


## Bigquery Machine Leraning
1. Google BQ ML Doc: https://cloud.google.com/bigquery-ml/docs/introduction
2. https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-e2e-journey
2. QWIKLABS: https://partner.cloudskillsboost.google/focuses/14821?parent=catalog

## Machine Learning
1. Google ML Doc: https://developers.google.com/machine-learning/glossary/#model
2. Youtube Linear vs Logistic Regression: https://www.youtube.com/watch?v=OCwZyYH14uw

## Dataset Shared by Christian
### http://www.fukuda-lab.org/mawilab/index.html

## BQ-ML Ref:
1. https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-detect-anomalies#mldetect_anomalies_function

### Default Parameter in cloud-shell

In [None]:
export PROJECT_ID="$(gcloud config get-value project)"
export LOCATION="us-central1"
export BQ_DATASET="test_dataset"
export BQ_TABLE="test_table"
export BUCKET="${PROJECT_ID}_bucket"

### Enable the below Google APIs

In [None]:
gcloud services enable storage.googleapis.com
gcloud services enable bigquery.googleapis.com

### Create GCS Bucket

In [None]:
gsutil mb -c standard -l ${LOCATION} gs://${BUCKET}

### Create Bigquery Dataset

In [None]:
bq --location=${LOCATION} mk \
--dataset \
--default_table_expiration 3600 \
${PROJECT_ID}:${BQ_DATASET}

#### Source data
1. Load the downloaded logs data into gcs bucket
2. create local **schema_json.json** file and put below schema json inside the file. **schema file should be placed locally**.

In [None]:
[
  {
    "name": "anomalyID",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "a unique anomaly identifier. Several lines in the CSV file can describe different sets of packets that belong to the same anomaly. The anomalyID field permits to identify lines that refer to the same anomaly."
  },
  {
    "name": "srcIP",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "is the source IP address of the identified anomalous traffic (optional)."
  },
  {
    "name": "srcPort",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "is the source port of the identified anomalous traffic (optional)."
  },
  {
    "name": "dstIP",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "is the destination IP address of the identified anomalous traffic (optional)."
  },
  {
    "name": "dstPort",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "is the destination port of the identified anomalous traffic (optional)."
  },
  {
    "name": "taxonomy",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "is the category assigned to the anomaly using the taxonomy for backbone traffic anomalies."
  },
  {
    "name": "heuristic",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "is the code assigned to the anomaly using simple heuristic based on port number, TCP flags and ICMP code."
  },
  {
    "name": "distance",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "is the difference Dn-Da, see XML Schema (admd)."
  },
  {
    "name": "nbDetectors",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "is the number of configurations (detector and parameter tuning) that reported the anomaly."
  },
  {
    "name": "label",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "is the MAWILab label assigned to the anomaly, it can be either: anomalous, suspicious, or notice."
  }
]

### Load data from GCS to BQ (ELT)

In [None]:
export FILE_NAME01="20210528_anomalous_suspicious.csv"
export BQ_TABLE01="20210528_anomalous_suspicious"

bq --location=${LOCATION} load \
--autodetect \
--skip_leading_rows=1 \
--source_format=CSV \
${PROJECT_ID}:${BQ_DATASET}.${BQ_TABLE01} \
gs://${BUCKET}/${FILE_NAME01}


In [None]:
export FILE_NAME02="20210528_notice.csv"
export BQ_TABLE02="20210528_notice"


bq --location=${LOCATION} load \
--autodetect \
--skip_leading_rows=1 \
--source_format=CSV \
${PROJECT_ID}:${BQ_DATASET}.${BQ_TABLE02} \
gs://${BUCKET}/${FILE_NAME02}

### Create training and testing table 
1. Do some pre-processing the data first
2. then combine these 2 tables and take 80%-20% distribution of the training and testing dataset

In [None]:
create or replace table `test_dataset.ml_dataset`(
  id INT64,
  anomalyID INT64,
  srcIP STRING, 
  srcPort STRING, 
  dstIP STRING, 
  dstPort STRING, 
  taxonomy STRING, 
  heuristic STRING, 
  distance STRING, 
  nbDetectors STRING, 
  label STRING
);

insert into `test_dataset.ml_dataset`
select id,
  anomalyID,
  IFNULL(srcIP, '0') as srcIP, 
  IFNULL(srcPort, '0') as srcPort, 
  IFNULL(dstIP, '0') as dstIP, 
  IFNULL(dstPort, '0') as dstPort, 
  IFNULL(taxonomy, '0') as taxonomy, 
  IFNULL(heuristic, '0') as heuristic, 
  IFNULL(distance, '0') as distance, 
  IFNULL(nbDetectors, '0') as nbDetectors, 
  IFNULL(label, '0') as label
from
	(select row_number() over() as ID,
	anomalyID, 
	cast(_srcIP as string) as srcIP, 
	cast(_srcPort as string) as srcPort,
	cast(_dstIP as string) as dstIP, 
	cast(_dstPort as string) as dstPort, 
	cast(_taxonomy as string) as taxonomy,
	'' as heuristic,
	cast(_heuristic as string) as distance, 
	cast(_distance as string) as nbDetectors, 
	cast(_nbDetectors as string) as label 
	# cast(_label as string) as label
	from `test_dataset.20210528_anomalous_suspicious`
	 
	union all 

	select row_number() over() as ID,
	row_number() over() as anomalyID,
	cast(anomalyID as string) as srcIP,
	cast(_srcIP as string) as srcPort, 
	cast(_srcPort as string) as dstIP,
	cast(_dstIP as string) as dstPort, 
	cast(_dstPort as string) as taxonomy, 
	cast(_taxonomy as string) as heuristic, 
	cast(_heuristic as string) as distance, 
	cast(_distance as string) as nbDetectors, 
	cast(_nbDetectors as string) as label
	# cast(_label as string) as label 
	from `test_dataset.20210528_notice`
	);

create or replace table `test_dataset.training_dataset`(
  id INT64,
  anomalyID INT64,
  srcIP STRING, 
  srcPort STRING, 
  dstIP STRING, 
  dstPort STRING, 
  taxonomy STRING, 
  heuristic STRING, 
  distance STRING, 
  nbDetectors STRING, 
  label STRING
);

insert into `test_dataset.training_dataset` 
select * 
from `test_dataset.ml_dataset` TABLESAMPLE SYSTEM (80 PERCENT)
where rand() < 0.8;

create or replace table `test_dataset.testing_dataset`(
  id INT64,
  anomalyID INT64,
  srcIP STRING, 
  srcPort STRING, 
  dstIP STRING, 
  dstPort STRING, 
  taxonomy STRING, 
  heuristic STRING, 
  distance STRING, 
  nbDetectors STRING, 
  label STRING
);

insert into `test_dataset.testing_dataset` 
select * 
from `test_dataset.ml_dataset` as t1 
where not exists(select 1 from `test_dataset.training_dataset` as t2 where t1.id = t2.id)



In [None]:
select count(*) from `test_dataset.ml_dataset`; # 482

select count(*) from `test_dataset.training_dataset`;  # 387

select count(*) from `test_dataset.testing_dataset`; # 19

### Create Model in Bigquery from training data

In [None]:
CREATE OR REPLACE MODEL `test_dataset.test_model_logistic_reg`
OPTIONS(model_type='LOGISTIC_REG', INPUT_LABEL_COLS = ['label']) AS
	select anomalyID, 
	srcIP, 
	srcPort, 
	dstIP, 
	dstPort, 
	taxonomy, 
	heuristic, 
	distance, 
	nbDetectors, 
	label 
	from `test_dataset.training_dataset`

The model produced can be queried. Based on the prior query you now have a new model available. You can use the ML.EVALUATE function (documentation) to see the evaluation metrics of all the created models (one per item):

In [None]:
SELECT
  *
FROM
  ML.EVALUATE(MODEL `test_dataset.test_model_logistic_reg`)

In [None]:
SELECT
  *
FROM
  ML.PREDICT(MODEL `test_dataset.test_model_logistic_reg`,
    (
    SELECT
      * except(id, label)
    FROM
      `test_dataset.testing_dataset`))