# Using ML - ML fundamentals

The `bigframes.ml` module implements Scikit-Learn's machine learning API in
BigQuery DataFrames. It exposes BigQuery's ML capabilities in a simple, popular
API that works seamlessly with the rest of the BigQuery DataFrames API.

This notebook is adapted from the following doc: [ML Fundamental with bigframes.ml](https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/ml_fundamentals.ipynb)

In [1]:
# Lets load some test data to use in this tutorial
import bigframes.pandas

df = bigframes.pandas.read_gbq("bigquery-public-data.ml_datasets.penguins")
df = df.dropna()

# Temporary workaround: lets name our index so it isn't lost BigQuery DataFrame
# currently drops unnamed indexes when round-tripping through pandas, which
# some ML APIs do to route around missing functionality
df.index.name = "penguin_id"

df

Unnamed: 0_level_0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,Gentoo penguin (Pygoscelis papua),Biscoe,50.5,15.9,225.0,5400.0,MALE
1,Gentoo penguin (Pygoscelis papua),Biscoe,45.1,14.5,215.0,5000.0,FEMALE
2,Adelie Penguin (Pygoscelis adeliae),Torgersen,41.4,18.5,202.0,3875.0,MALE
3,Adelie Penguin (Pygoscelis adeliae),Torgersen,38.6,17.0,188.0,2900.0,FEMALE
4,Gentoo penguin (Pygoscelis papua),Biscoe,46.5,14.8,217.0,5200.0,FEMALE
5,Adelie Penguin (Pygoscelis adeliae),Biscoe,35.0,17.9,192.0,3725.0,FEMALE
7,Gentoo penguin (Pygoscelis papua),Biscoe,42.0,13.5,210.0,4150.0,FEMALE
8,Gentoo penguin (Pygoscelis papua),Biscoe,48.5,14.1,220.0,5300.0,MALE
9,Adelie Penguin (Pygoscelis adeliae),Torgersen,45.8,18.9,197.0,4150.0,MALE
10,Chinstrap penguin (Pygoscelis antarctica),Dream,49.0,19.6,212.0,4300.0,MALE


## Data split

Part of preparing data for a machine learning task is splitting it into subsets for training and testing, to ensure that the solution is not overfitting. Most commonly this is done with `bigframes.ml.model_selection.train_test_split` like so:

In [2]:
# In this example, we're doing supervised learning, where we will learn to predict
# output variable `y` from input features `X`
X = df[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex', 'species']]
y = df[['body_mass_g']] 

from bigframes.ml.model_selection import train_test_split

# This will split X and y into test and training sets, with 20% of the rows in the test set,
# and the rest in the training set
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2)

# Show the shape of the data after the split
print(f"""X_train shape: {X_train.shape}
X_test shape: {X_test.shape}
y_train shape: {y_train.shape}
y_test shape: {y_test.shape}""")

X_train shape: (267, 6)
X_test shape: (67, 6)
y_train shape: (267, 1)
y_test shape: (67, 1)


In [3]:
# If we look at the data, we can see that random rows were selected for
# each side of the split
X_test.head(5)

Unnamed: 0_level_0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,sex,species
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
312,Biscoe,48.4,16.3,220.0,MALE,Gentoo penguin (Pygoscelis papua)
142,Biscoe,46.1,13.2,211.0,FEMALE,Gentoo penguin (Pygoscelis papua)
317,Biscoe,40.1,18.9,188.0,MALE,Adelie Penguin (Pygoscelis adeliae)
343,Dream,45.2,16.6,191.0,FEMALE,Chinstrap penguin (Pygoscelis antarctica)
326,Biscoe,44.4,17.3,219.0,MALE,Gentoo penguin (Pygoscelis papua)


In [4]:
# Note that this matches the rows in X_test
y_test.head(5)

Unnamed: 0_level_0,body_mass_g
penguin_id,Unnamed: 1_level_1
312,5400.0
142,4500.0
317,4300.0
343,3250.0
326,5250.0


## Estimators

Following Scikit-Learn, all learning components are "estimators"; objects that can learn from training data and then apply themselves to new data. Estimators share the following patterns:

- a constructor that takes a list of parameters
- a standard string representation that shows the class name and all non-default parameters, e.g. `LinearRegression(fit_intercept=False)`
- a `.fit(..)` method to fit the estimator to training data

There estimators can be further broken down into two main subtypes:

### Transformers

Transformers are estimators that are used to prepare data for consumption by other estimators ('preprocessing'). In addition to `.fit(...)`, the transformer implements a `.transform(...)` method, which will apply a transformation based on what was computed during `.fit(..)`. With this pattern dynamic preprocessing steps can be applied to both training and test/production data consistently.

An example of a transformer is `bigframes.ml.preprocessing.StandardScaler`, which rescales a dataset to have a mean of zero and a standard deviation of one:

In [5]:
from bigframes.ml.preprocessing import StandardScaler

# StandardScaler will only work on numeric columns
numeric_columns = ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]

scaler = StandardScaler()
scaler.fit(X_train[numeric_columns])

# Now, standardscaler should transform the numbers to have mean of zero
# and standard deviation of one:
scaler.transform(X_train[numeric_columns])

Unnamed: 0_level_0,standard_scaled_culmen_length_mm,standard_scaled_culmen_depth_mm,standard_scaled_flipper_length_mm
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1.131292,-0.643554,1.68983
1,0.167457,-1.353151,0.986496
4,0.41734,-1.201095,1.127163
5,-1.635271,0.370156,-0.631172
7,-0.385855,-1.860007,0.634829
8,0.774316,-1.555893,1.338163
9,0.292399,0.877012,-0.279505
10,0.86356,1.23181,0.775496
11,-1.474632,-0.288755,-0.771839
12,1.238385,-0.440812,1.338163


In [6]:
# We can then repeat this transformation on new data
scaler.transform(X_test[numeric_columns])

Unnamed: 0_level_0,standard_scaled_culmen_length_mm,standard_scaled_culmen_depth_mm,standard_scaled_flipper_length_mm
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,-0.492948,0.674269,0.072162
3,-0.992714,-0.086013,-0.912505
14,0.488736,-0.288755,-0.631172
15,-0.546494,0.167414,-0.771839
20,-0.243065,-1.505208,0.564496
27,0.27455,-0.086013,-0.420172
34,-0.546494,0.471527,0.283162
67,0.613677,-1.353151,0.986496
68,-0.600041,0.623584,-0.420172
75,-1.135504,1.434552,-0.771839


#### Composing transformers

To process data where different columns need different preprocessors, `bigframes.composition.ColumnTransformer` can be employed:

In [7]:
from bigframes.ml.compose import ColumnTransformer
from bigframes.ml.preprocessing import OneHotEncoder

# Create an aggregate transform that applies StandardScaler to the numeric columns,
# and OneHotEncoder to the string columns
preproc = ColumnTransformer([
    ("scale", StandardScaler(), ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]),
    ("encode", OneHotEncoder(), ["species", "sex", "island"])])

# Now we can fit all columns of the training data
preproc.fit(X_train)

processed_X_train = preproc.transform(X_train)
processed_X_test = preproc.transform(X_test)

processed_X_train

Unnamed: 0_level_0,onehotencoded_island,standard_scaled_culmen_length_mm,standard_scaled_culmen_depth_mm,standard_scaled_flipper_length_mm,onehotencoded_sex,onehotencoded_species
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,"[{'index': 1, 'value': 1.0}]",1.131292,-0.643554,1.68983,"[{'index': 3, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
1,"[{'index': 1, 'value': 1.0}]",0.167457,-1.353151,0.986496,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
4,"[{'index': 1, 'value': 1.0}]",0.41734,-1.201095,1.127163,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
5,"[{'index': 1, 'value': 1.0}]",-1.635271,0.370156,-0.631172,"[{'index': 2, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
7,"[{'index': 1, 'value': 1.0}]",-0.385855,-1.860007,0.634829,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
8,"[{'index': 1, 'value': 1.0}]",0.774316,-1.555893,1.338163,"[{'index': 3, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
9,"[{'index': 3, 'value': 1.0}]",0.292399,0.877012,-0.279505,"[{'index': 3, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
10,"[{'index': 2, 'value': 1.0}]",0.86356,1.23181,0.775496,"[{'index': 3, 'value': 1.0}]","[{'index': 2, 'value': 1.0}]"
11,"[{'index': 3, 'value': 1.0}]",-1.474632,-0.288755,-0.771839,"[{'index': 2, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
12,"[{'index': 1, 'value': 1.0}]",1.238385,-0.440812,1.338163,"[{'index': 3, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"


### Predictors

Predictors are estimators that learn and make predictions. In addition to `.fit(...)`, the predictor implements a `.predict(...)` method, which will use what was learned during `.fit(...)` to predict some output.

Predictors can be further broken down into two categories:

#### Supervised predictors

Supervised learning is when we train a model on input-output pairs, and then ask it to predict the output for new inputs. An example of such a predictor is `bigframes.ml.linear_models.LinearRegression`.

In [8]:
from bigframes.ml.linear_model import LinearRegression

linreg = LinearRegression()

# Learn from the training data how to predict output y
linreg.fit(processed_X_train, y_train)

# Predict y for the test data
predicted_y_test = linreg.predict(processed_X_test)

predicted_y_test

Unnamed: 0_level_0,predicted_body_mass_g
penguin_id,Unnamed: 1_level_1
2,4091.429391
3,3311.639644
14,3343.000454
15,3916.46808
20,4588.140941
27,3392.166301
34,4224.344845
67,5200.833055
68,4051.930559
75,3966.128585


#### Unsupervised predictors

In unsupervised learning, there are no known outputs in the training data, instead the model learns on input data alone and predicts something else. An example of an unsupervised predictor is `bigframes.ml.cluster.KMeans`, which learns how to fit input data to a target number of clusters.

In [9]:
from bigframes.ml.cluster import KMeans

kmeans = KMeans(n_clusters=4)

kmeans.fit(processed_X_train)

kmeans.predict(processed_X_test)

Unnamed: 0_level_0,CENTROID_ID
penguin_id,Unnamed: 1_level_1
2,3
3,3
14,1
15,3
20,4
27,1
34,3
67,4
68,3
75,3


## Pipelines

Transfomers and predictors can be chained into a single estimator component using `bigframes.ml.pipeline.Pipeline`:

In [None]:
from bigframes.ml.pipeline import Pipeline

pipeline = Pipeline([
  ('preproc', preproc),
  ('linreg', linreg)
])

# Print our pipeline
pipeline

The pipeline simplifies the workflow by applying each of its component steps automatically:

In [None]:
pipeline.fit(X_train, y_train)

predicted_y_test = pipeline.predict(X_test)
predicted_y_test

In the backend, a pipeline will actually be compiled into a single model with an embedded TRANSFORM step.

## Evaluating results

Some models include a convenient `.score(X, y)` method for evaulation with a preset accuracy metric:

In [None]:
# In the case of a pipeline, this will be equivalent to calling .score on the contained LinearRegression
pipeline.score(X_test, y_test)

For a more general approach, the library `bigframes.ml.metrics` is provided:

In [None]:
from bigframes.ml.metrics import r2_score

r2_score(y_test, predicted_y_test["predicted_body_mass_g"])

## Save/Load to BigQuery

Estimators can be saved to BigQuery as BQML models, and loaded again in future.

Saving requires `bigquery.tables.create` permission, and loading requires `bigquery.models.getMetadata` permission.
These permissions can be at project level or the dataset level.

If you have those permissions, please go ahead and uncomment the code in the following cells and run.

In [15]:
# # Replace with a path where you have permission to save a model
# model_name = "bigframes-dev.bqml_tutorial.penguins_model"

# linreg.to_gbq(model_name, replace=True)

In [16]:
# # WARNING - until b/281709360 is fixed & pipeline is updated, pipelines will load as models,
# # and details of their transform steps will be lost (the loaded model will behave the same)
# bigframes.pandas.read_gbq_model(model_name)