# Chapter 5 - Part 1: Model Training

Welcome to chapter 5 of our Snowflake Data Scientist training series.

In chapter 5 we will look at model deployment options. The code is structured into three parts:
- Part 1: We train a model and save its binary file to disk, called the "pickle file".
- Part 2: We deploy the pickle file to an Azure Function and call it via API Integration from Snowflake
- Part 3: We deploy the pickle file directly to Snowflake using SnowPark.

Happy coding!


### 1.) Connecting to Snowflake

To connect to your Snowflake instance, make sure you have all requirements installed and your connection details ready.

In [2]:
%load_ext sql
%config SqlMagic.autocommit=False # for engines that do not support autommit

In [5]:
##
## Make sure you have DATABASE_URL set & exported in your environment. Else run the following magic command:
##   Snowflake driver accepts the following parameters
##   URL = 'snowflake://<user_login_name>:<password>@<account_identifier>/<database_name>/<schema_name>?warehouse=<warehouse_name>&role=<role_name>'
##   Example:
##   %sql snowflake://user:password@xxxyyyzzz.west-europe.azure/DEMO/UDEMY?warehouse=PUBLIC
##

In [None]:
%sql SELECT 1 as "Connected"

### 2.) Getting & transforming data from Snowflake

First we get the data from Snowflake. Typically you'd want to do this 
as part of a CI/CD pipeline to regularly train your model. 

We do the following steps to get our data read:
- Step 1: Only select the data we want the machine learning model to see, e.g. no patient ID
- Step 2: To allow for label encoding (converting strings to numbers) we concatenate the features sex and agegroup.
- Step 3: We then select the encoded sex& agegroup feature alongside blood pressure before and after the invention.


In [None]:
%%sql result_set <<

---- build out all features: step 1
with feature_table as (
    SELECT 
        sex || ' ' || agegrp as sex_agegrp, 
        bp_before, 
        bp_after
    FROM blood_pressure 
), 

---- label encoding Snowflake style: step 2
distinct_values_table as (
    SELECT 
        array_agg(distinct sex_agegrp) as sex_agrgrp_array 
    FROM feature_table
)

---- and return it: step 3
SELECT
    array_position(sex_agegrp::variant, sex_agrgrp_array) as sex_agrgrp_position,
    bp_before,
    bp_after
FROM feature_table, distinct_values_table


In [9]:
### A data object is returned. We not convert it into a Pandas dataframe for later use.
## Convert it to Pandas
df = result_set.DataFrame()

### Step 3: Prepare the data
As part of the model training, we split our dataframe into features (X) and labels (Y). The labels are the data fields we want to predict. Further we split our dataset up into train & test sets to evaluate our performance on the dataset. Later on you'd want to do this split upstream in Snowflake.

In [12]:
from sklearn.model_selection import train_test_split

x = df[['sex_agrgrp_position', 'bp_before']]
y = df['bp_after']

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.10)

### Step 4: Train the model

We select a RandomForestRegressor as our ML model for simplicity.

In [13]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
rfr.fit(xtrain, ytrain)

In [14]:
score = rfr.score(xtrain, ytrain)
print("R-squared:", score) 

R-squared: 0.7641918988124163


We have a model R-squared of about 0.75, meaning 75%. This means that our model explains 75% of the variation in the response variable around its mean. Not too bad for a first try ;-)

### Step 5: Save the classifier to pickle file.
Now that we are done training the model, we want to save it so Snowflake can use it in the next steps.

In [138]:
import joblib

filename = 'randomforest_classifier.joblib.pkl'
_ = joblib.dump(rfr, filename, compress=9)