In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Get started with BigQuery DataFrames

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/getting_started_bq_dataframes.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/getting_started_bq_dataframes.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/getting_started_bq_dataframes.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.10

This notebokk is adapted from the following doc: [Getting Started with BigFrame](https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/getting_started_bq_dataframes.ipynb)

## Overview

Use this notebook to get started with BigQuery DataFrames, including setup, installation, and basic tutorials.

BigQuery DataFrames provides a Pythonic DataFrame and machine learning (ML) API powered by the BigQuery engine.

* `bigframes.pandas` provides a pandas-like API for analytics.
* `bigframes.ml` provides a scikit-learn-like API for ML.

Learn more about [BigQuery DataFrames](https://cloud.google.com/python/docs/reference/bigframes/latest).

### Objective

In this tutorial, you learn how to install BigQuery DataFrames, load data into a BigQuery DataFrames DataFrame, and inspect and manipulate the data using pandas and a custom Python function, running at BigQuery scale.

The steps include:

- Creating a BigQuery DataFrames DataFrame: Access data from a local CSV to create a BigQuery DataFrames DataFrame.
- Inspecting and manipulating data: Use pandas to perform data cleaning and preparation on the DataFrame.
- Deploying a custom function: Deploy a [remote function ](https://cloud.google.com/bigquery/docs/remote-functions)that runs a scalar Python function at BigQuery scale.

### Dataset

This tutorial uses the [```penguins``` table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=ml_datasets&t=penguins) (a BigQuery public dataset), which contains data on a set of penguins including species, island of residence, weight, culmen length and depth, flipper length, and sex.

The same dataset is also stored in a public Cloud Storage bucket as a CSV file so that you can use it to try ingesting data from a local environment.

## Installation

Install the following packages, which are required to run this notebook:

In [1]:
!pip install bigframes



### Run these ONLY If you use Google Colab

Uncomment and run the following cell to restart the kernel:

In [9]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

Complete the tasks in this section to set up your environment.

#### Set your project ID


In [2]:
#enter your project id

PROJECT_ID = "bqstackdemo"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

Updated property [core/project].


#### Set the region

You can also change the `REGION` variable used by BigQuery. Learn more about [BigQuery regions](https://cloud.google.com/bigquery/docs/locations#supported_locations).

In [3]:
REGION = "US"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below.

**Vertex AI Workbench**

Do nothing, you are already authenticated.

**Local JupyterLab instance**

Uncomment and run the following cell:

In [4]:
# ! gcloud auth login

**Colab**

Uncomment and run the following cell:

In [5]:
# from google.colab import auth
# auth.authenticate_user()

### Import libraries

In [6]:
import bigframes.pandas as bf


### Set BigQuery DataFrames options

In [7]:
bf.options.bigquery.project = PROJECT_ID
bf.options.bigquery.location = REGION

If you want to reset the location of the created DataFrame or Series objects, reset the session by executing `bf.close_session()`. After that, you can reuse `bf.options.bigquery.location` to specify another location.

## See the power of BigQuery DataFrames first-hand

BigQuery DataFrames enables you to interact with datasets of any size, so that you can explore, transform, and understand even your biggest datasets using familiar tools like pandas and scikit-learn.

For example, take the BigQuery sample table `bigquery-samples.wikipedia_pageviews.200809h`, which is ~60 GB is size. This is not a dataset you'd likely be able process in pandas without extra infrastructure.

With BigQuery DataFrames, however, computation is handled by BigQuery's highly scalable compute engine, meaning you can focus on doing data science without hitting size limitations.

If you'd like to try creating a BigQuery DataFrames DataFrame from this table, uncomment and run the next cell to load the table using the `read_gbq` method.



In [8]:
bq_df_sample = bf.read_gbq("bigquery-samples.wikipedia_pageviews.200809h")

No problem! BigQuery DataFrames makes a DataFrame, `bq_df_sample`, containing the entirety of the source table of data.

Uncomment and run the following cell to see pandas in action over your new BigQuery DataFrames DataFrame.

This code uses regex to filter the DataFrame to include only rows with Wikipedia page titles containing the word "Google", sums the total views by page title, and then returns the top 100 results.

In [9]:
bq_df_sample[bq_df_sample.title.str.contains(r"[Gg]oogle")]\
 .groupby(['title'], as_index=False)['views'].sum(numeric_only=True)\
 .sort_values('views', ascending=False)\
 .head(100)

Unnamed: 0,title,views
21911,Google,1414560
27669,Google_Chrome,962482
28394,Google_Earth,383566
29184,Google_Maps,205089
27251,Google_Android,99450
33900,Google_search,97665
31825,Google_chrome,78399
30204,Google_Street_View,71580
40798,Image:Google_Chrome.png,60746
35222,Googleplex,53848


In addition to giving you access to pandas, BigQuery DataFrames also enables you to build ML models, run inference, and deploy and run your own Python functions at scale. You'll see examples throughout this and other notebooks in this GitHub repo.

Now you'll move to the smaller `penguins` dataset for the remainder of this getting started guide.

## Create a BigQuery DataFrames DataFrame

You can create a BigQuery DataFrames DataFrame by reading data from any of the following locations:

* A local data file
* Data stored in a BigQuery table
* A data file stored in Cloud Storage
* An in-memory pandas DataFrame

The following sections show how to use the first two options.

### Create a DataFrame from a local file

Use the instructions in the following sections to create a BigQuery DataFrames DataFrame from a local file.


#### Get the CSV file

First, copy and paste the following link into a new browser window to download the CSV file of the penguin data to your local machine:

> http://storage.googleapis.com/cloud-samples-data/vertex-ai/bigframe/penguins.csv

Next, upload the local CSV file to your notebook environment, using the relevant instructions for your environment:

**Vertex AI Workbench or a local JupyterLab instance**

1. Follow these [directions](https://jupyterlab.readthedocs.io/en/latest/user/files.html#uploading-and-downloading) to upload the file from your machine to your notebook environment by using the UI.
2. set the variable `fn` to match the path to your file, and then run the cell.

In [10]:
fn = 'penguins.csv'

**Colab**

Uncomment and run the following cell:

In [11]:
# from google.colab import files
# uploaded = files.upload()
# for fn in uploaded.keys():
#  print('User uploaded file "{name}" with length {length} bytes'.format(
#      name=fn, length=len(uploaded[fn])))

#### Create a DataFrame

Create a BigQuery DataFrames DataFrame from the uploaded CSV file:

In [12]:
df_from_local = bf.read_csv(fn)

Take a look at the first few rows of the DataFrame:

In [13]:
df_from_local.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie Penguin (Pygoscelis adeliae),Dream,36.6,18.4,184.0,3475.0,FEMALE
1,Adelie Penguin (Pygoscelis adeliae),Dream,39.8,19.1,184.0,4650.0,MALE
2,Adelie Penguin (Pygoscelis adeliae),Dream,40.9,18.9,184.0,3900.0,MALE
3,Chinstrap penguin (Pygoscelis antarctica),Dream,46.5,17.9,192.0,3500.0,FEMALE
4,Adelie Penguin (Pygoscelis adeliae),Dream,37.3,16.8,192.0,3000.0,FEMALE


### Ingest data from a DataFrame to a BigQuery table

BigQuery DataFrames lets you create a BigQuery table from a BigQuery DataFrames DataFrame on-the-fly.

First, create a BigQuery dataset to house the table. Choose a name for your dataset, or keep the suggestion of `birds`.

In [14]:
DATASET_ID = "birds"  # @param {type:"string"}

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID)
dataset = bigquery.Dataset(PROJECT_ID + "." + DATASET_ID)
dataset.location = REGION
dataset = client.create_dataset(dataset, exists_ok=True)
print(f"Dataset {dataset.dataset_id} created.")

Dataset birds created.


Next, use the `to_gbq` method to create a BigQuery table from the DataFrame:

In [15]:
df_from_local.to_gbq(PROJECT_ID + "." + DATASET_ID + ".penguins")

'bqstackdemo.birds.penguins'

### Create a DataFrame from BigQuery data
You can create a BigQuery DataFrames DataFrame from a BigQuery table by using the `read_gbq` method and referencing either an entire table or a SQL query.

Create a BigQuery DataFrames DataFrame from the BigQuery table you created in the previous section, and view a few rows:

In [16]:
query_or_table = f"""{PROJECT_ID}.{DATASET_ID}.penguins"""
bq_df = bf.read_gbq(query_or_table)
bq_df.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Gentoo penguin (Pygoscelis papua),Biscoe,50.5,15.9,225.0,5400.0,MALE
1,Gentoo penguin (Pygoscelis papua),Biscoe,45.1,14.5,215.0,5000.0,FEMALE
2,Adelie Penguin (Pygoscelis adeliae),Torgersen,41.4,18.5,202.0,3875.0,MALE
3,Adelie Penguin (Pygoscelis adeliae),Torgersen,38.6,17.0,188.0,2900.0,FEMALE
4,Gentoo penguin (Pygoscelis papua),Biscoe,46.5,14.8,217.0,5200.0,FEMALE


## Inspect and manipulate data in BigQuery DataFrames

### Using pandas

You can use pandas as you normally would on the BigQuery DataFrames DataFrame, but calculations happen in the BigQuery query engine instead of your local environment. There are 150+ pandas functions supported in BigQuery DataFrames. You can view the list in [the documentation](https://cloud.google.com/python/docs/reference/bigframes/latest).

To see this in action, inspect one of the columns (or series) of the BigQuery DataFrames DataFrame:

In [17]:
bq_df["body_mass_g"].head(10)

0    5400.0
1    5000.0
2    3875.0
3    2900.0
4    5200.0
5    3725.0
6    2975.0
7    4150.0
8    5300.0
9    4150.0
Name: body_mass_g, dtype: Float64

Compute the mean of this series:

In [20]:
average_body_mass = bq_df["body_mass_g"].mean()
print(f"average_body_mass: {average_body_mass}")

average_body_mass: 4201.75438596491


You can confirm that the calculations were run in BigQuery by clicking "Open job" from the previous cells' output. This takes you to the BigQuery console to view the SQL statement and job details.

## Summary and next steps

You've created BigQuery DataFrames DataFrames, and inspected and manipulated data with pandas and custom remote functions at BigQuery scale and speed.

Learn more about BigQuery DataFrames in the [documentation](https://cloud.google.com/python/docs/reference/bigframes/latest) and find more sample notebooks in the [GitHub repo](https://github.com/googleapis/python-bigquery-dataframes/tree/main/notebooks), including an introductory notebook for `bigframes.ml`.

### Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:

In [None]:
# # Delete the BigQuery dataset
# from google.cloud import bigquery
# client = bigquery.Client(project=PROJECT_ID)
# client.delete_dataset(
#  DATASET_ID, delete_contents=True, not_found_ok=True
# )
# print("Deleted dataset '{}'.".format(DATASET_ID))

In [None]:
# # Delete the BigQuery Connection
# from google.cloud import bigquery_connection_v1 as bq_connection
# client = bq_connection.ConnectionServiceClient()
# CONNECTION_ID = f"projects/{PROJECT_ID}/locations/{REGION}/connections/bigframes-rf-conn"
# client.delete_connection(name=CONNECTION_ID)
# print("Deleted connection '{}'.".format(CONNECTION_ID))

In [None]:
# # Delete the Cloud Function
# ! gcloud functions delete {CLOUD_FUNCTION_NAME} --quiet

In [None]:
# # Delete the Remote Function
# REMOTE_FUNCTION_NAME = REMOTE_FUNCTION_NAME.replace(PROJECT_ID + ".", "")
# ! bq rm --routine --force=true {REMOTE_FUNCTION_NAME}