# Using Spanner with Machine Learning

Spanner is a great host for your production data, supporting arbitrary transaction throughput and scale for modern applications.

Users often want to access their production data from Spanner in order to generate models or perform modern ML operations.  Spanner integrates with Python to support popular machine-learning operations.

These instructions assume that you have an account and project with Google Cloud.  If not, you can sign up at [cloud.google.com](https://cloud.google.com/).

## Step 1: Install Dependencies

Connecting to Spanner requires installing Spanner's Python Client.

The spanner-analytics package contains additional functions that facilitate ML workflows, including a Jupyter "Magic" command for bulk-fetching data from Spanner.

This example demonstrates how to use Spanner data with scikit-learn models.  This approach can easily be adapted to other popular modeling libraries as well.

In [None]:
!pip install google-cloud-spanner spanner-analytics scikit-learn
%load_ext spanner_analytics.magic

## Step 2:  Authenticate to GCP

Google offers a variety of options for authenticating to GCP.  Please see the [documentation](https://googleapis.dev/python/google-api-core/latest/auth.html) for more details.

Google's hosted Notebook offerings provide a convenient built-in authentication method, as illustrated below.  This method will open a pop-up window asking you to authenticate this notebook using your Google credentials.

In [None]:
from google.colab import auth
auth.authenticate_user()

## Step 3:  Fetching Data

Now we want to read data from Spanner into a DataFrame.  To do this, we use the Spanner Magic command that we loaded previously.

Make sure to substitute in your desired project, instance, and database IDs.  Also, update the query to point at an existing table in your database.  If you don't yet have an instance or a database or a table, see Cloud Spanner's [Quickstart Guide](https://cloud.google.com/spanner/docs/quickstart-console).

In [None]:
%%spanner results_df --project <project_id> --instance <instance_id> --database <database_id>

SELECT train1, train2, train3, actual FROM my_table;

The query will be run using Spanner [DataBoost](https://cloud.google.com/spanner/docs/databoost/databoost-overview), which avoids placing load on your production cluster by spinning up new compute resources to run the query.

The query result is placed into a Pandas DataFrame named `results_df` (the first argument to the `%%spanner` magic):

In [None]:
results_df

## Step 4:  Training a Model

Now that we have the data in a DataFrame, we can train a model using it.  Let's train a Linear Regression model, just as an example.

In [None]:
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(results_df.loc[:, ["train1","train2","train3"]], results_df.loc[:, ["actual"]])

## Step 5:  Generating Predictions

Now that we have a model, we can go ahead and use it!  Let's try generating predictions using our training data.  This won't correctly validate the quality of the model; for that, we would need to generate a [train/test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) in scikit-learn.  But it's a simple example to demonstrate that the model has been trained.

In [None]:
regr.predict(results_df.loc[:, ["train1","train2","train3"]])