# 01 - BigQuery - Table Data Source
Use BigQuery to load and prepare data for machine learning:


---
## Source Data
`genai-demo-2024.ml_datasets.ulb_fraud_detection`

FraudFlix Technologies is a cutting-edge company focused on making financial transactions safer. Using machine learning, FraudFlix analyzes huge amounts of transaction data to spot and stop fraud as it happens. Born from a hackathon challenge, the company uses a special dataset of European credit card transactions to train its algorithms. What sets FraudFlix apart is its approach to continuously testing and improving its fraud detection models by simulating real-world transactions. This innovative strategy is a game-changer in the fight against digital fraud, offering both businesses and consumers a higher level of security. For data engineers and scientists, FraudFlix represents an exciting frontier where AI meets financial safety, showcasing practical applications of their skills to solve real-world problems.


---
## Setup

inputs:

In [1]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'genai-demo-2024'

In [12]:
REGION = 'us-central1'
EXPERIMENT = '01'
SERIES = '01'

# source data
BQ_PROJECT = PROJECT_ID
BQ_DATASET = 'ml_datasets'
BQ_TABLE = 'ulb_fraud_detection'

# Data source for this series of notebooks: Described above
BQ_SOURCE = 'genai-demo-2024.ml_datasets.ulb_fraud_detection'



packages:

In [3]:
from google.cloud import bigquery
from google.cloud import storage

clients:

In [4]:
bq = bigquery.Client(project = PROJECT_ID)
gcs = storage.Client(project = PROJECT_ID)

parameters:

In [5]:
BUCKET = PROJECT_ID

### Retrieve and Review a Sample From The Table:
> **Note:** The `LIMIT 5` statement does limit the number of rows returned by BigQuery to 5 but BigQuery still does a full table scan.  If you have a table larger than 1GB and want to limit the rows scanned for a quick review like then then replacing `LIMIT 5` with `TABLESAMPLE SYSTEM (1 PERCENT)` would be more efficient.  For tables under 1GB it will still return the full table.  More on [Table Sampling](https://cloud.google.com/bigquery/docs/table-sampling)

In [13]:
query = f"""
SELECT *
FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` TABLESAMPLE SYSTEM (1 PERCENT)
#LIMIT 5
"""
bq.query(query = query).to_dataframe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V22,V23,V24,V25,V26,V27,V28,Amount,Class,Feedback
0,171014.0,-1.131189,1.811996,-1.122398,-0.460932,2.092069,-1.354922,1.669337,-0.222706,-1.181626,...,-0.076353,-0.616869,0.435200,1.269360,0.773910,-0.464797,-0.210717,0.760000,0,very satisfied.
1,35516.0,1.192986,0.450860,-0.481564,0.674517,0.238480,-0.632110,0.168169,0.007867,-0.316793,...,-0.172393,-0.053980,-0.087989,0.445126,0.383433,-0.025874,0.021537,0.760000,0,very satisfied.
2,42514.0,1.312386,0.653368,-0.772995,0.755198,0.366244,-1.169847,0.409476,-0.294106,-0.153917,...,-0.466311,-0.146696,-0.222764,0.641990,0.399379,-0.030434,0.038114,0.760000,0,very satisfied.
3,50858.0,-0.523831,0.845053,2.798503,2.187664,-0.589713,-0.206753,-0.007914,0.240690,-0.479942,...,0.619409,-0.125363,0.895330,-0.175804,0.035546,0.083254,0.079792,0.760000,0,very satisfied.
4,149517.0,-1.570263,0.471418,-1.250386,-0.460958,3.555698,-1.070799,0.958921,-0.093190,-1.141216,...,-0.236868,-0.662132,-0.517025,1.326950,0.853136,-0.285010,0.288440,0.760000,0,very satisfied.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66672,46286.0,-0.946968,1.452352,1.494954,0.095164,-0.572162,-1.184136,0.381151,0.243009,-0.654278,...,-0.447785,0.162138,0.923492,-0.572078,-0.081837,0.092660,0.147710,1.870000,0,Very happy with the quick and efficient service.
66673,38211.0,0.390006,-0.820230,0.664101,2.702281,-0.834142,0.142078,0.164984,0.052547,-0.731222,...,0.207523,-0.383898,0.337729,0.236839,0.011670,-0.063936,0.085387,382.970001,0,Very happy with the quick and efficient service.
66674,168097.0,1.808470,-0.652284,-0.451944,0.405496,-0.797639,-0.746692,-0.297487,-0.140971,1.010486,...,-0.713694,0.342933,-0.029395,-0.675873,0.223072,-0.058672,-0.025937,111.930000,0,Very happy with the quick and efficient service.
66675,66228.0,0.988617,-0.412961,0.257419,-0.088862,-0.588306,-0.496338,-0.021778,0.027533,0.008506,...,-0.506842,0.064342,0.254996,-0.020271,0.837375,-0.098967,0.010280,111.930000,0,Very happy with the quick and efficient service.


### Check out this table in BigQuery Console:
- Click: https://console.cloud.google.com/bigquery
- Make sure project selected is the one from this notebook
- Under Explore, expand this project and review the dataset and table

In [11]:
print(f"Direct Link To This Project In BigQuery:\nhttps://console.cloud.google.com/bigquery?project={PROJECT_ID}")

Direct Link To This Project In BigQuery:
https://console.cloud.google.com/bigquery?project=genai-demo-2024


---
## Review Data in BigQuery
Additional SQL queries could be used to review the data.  This section shows moving the table to a Pandas dataframe for local review in Python:

> **Note:** <p>This query only selects one column.  This means BigQuery scans less data as it does not process the other columns.  </p>

In [14]:
query = f"""
SELECT Class
FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
"""
df = bq.query(query = query).to_dataframe()

In [15]:
df['Class'].value_counts()

Class
0    276740
1       257
Name: count, dtype: Int64

In [16]:
df['Class'].value_counts(normalize=True)

Class
0    0.999072
1    0.000928
Name: proportion, dtype: Float64

---
## Prepare Data for Analysis

Create a prepped version of the data with test/train splits using SQL DDL:

In [17]:
query = f"""
CREATE TABLE IF NOT EXISTS `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_prepped` AS
WITH add_id AS(SELECT *, GENERATE_UUID() transaction_id FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`)
SELECT *,
    CASE 
        WHEN MOD(ABS(FARM_FINGERPRINT(transaction_id)),10) < 8 THEN "TRAIN" 
        WHEN MOD(ABS(FARM_FINGERPRINT(transaction_id)),10) < 9 THEN "VALIDATE"
        ELSE "TEST"
    END AS splits
FROM add_id
"""
job = bq.query(query = query)
job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f8582f67fa0>

In [18]:
(job.ended-job.started).total_seconds()

5.948

In [19]:
if job.estimated_bytes_processed:
    print(f'{job.estimated_bytes_processed/1000000} MB')

77.532674 MB


Review the test/train split:

In [20]:
query = f"""
SELECT splits, count(*) as Count, 100*count(*) / (sum(count(*)) OVER()) as Percentage
FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_prepped`
GROUP BY splits
"""
bq.query(query = query).to_dataframe()

Unnamed: 0,splits,Count,Percentage
0,TEST,27694,9.997942
1,TRAIN,221758,80.057907
2,VALIDATE,27545,9.944151


Retrieve a subset of the data to a Pandas dataframe:

In [21]:
query = f"""
SELECT * 
FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_prepped`
LIMIT 5
"""
data = bq.query(query = query).to_dataframe()

In [22]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V24,V25,V26,V27,V28,Amount,Class,Feedback,transaction_id,splits
0,135426.0,2.132591,-0.07907,-2.347959,0.060631,0.866875,-0.526201,0.438905,-0.192546,0.1603,...,0.133487,0.401384,0.70424,-0.132129,-0.098,0.76,0,very satisfied.,e5f648f8-d5e7-4ceb-86ef-cb7c6dc7d15a,TEST
1,97430.0,0.168853,0.136885,1.892742,-0.242915,-0.184315,-0.024114,0.012281,-0.310277,2.11054,...,0.041079,-1.382377,0.295097,-0.337653,-0.351998,14.95,0,very satisfied.,b786d86a-2a8b-4b3b-879e-650c221acbc5,TEST
2,40864.0,0.491464,-0.946247,-0.345005,1.495592,-0.01986,0.454469,0.629021,-0.123851,0.138186,...,-0.709041,0.672462,-0.304324,-0.025661,0.07582,387.790009,0,very satisfied.,0c1da076-6309-487d-86e6-995a641121ba,TEST
3,30606.0,-0.462273,-0.355619,2.048961,-1.706096,-1.08568,-0.303081,-0.704979,-0.057419,-2.31611,...,-0.109457,0.220757,0.050688,0.006466,0.062116,18.4,0,very satisfied.,c7009b0e-3be6-4b25-9374-05aa97ca0981,TEST
4,90355.0,-0.191694,1.227645,0.367228,-0.44199,0.581749,-1.091603,0.969535,-0.350792,1.23512,...,0.011476,-0.419318,0.099607,0.338829,0.153138,0.89,0,very satisfied.,727fcf48-4b2a-40bf-bfe8-aec6a9ebdf1d,TEST
