# Chapter 3 - Part 2: Sampling, Binning and Linear Regression

Welcome to chapter 3 of our Snowflake Data Scientist training series.

In chapter 3 we will look at feature engineering options. The code is structured into three parts:
- Part 1: We transform our data into a useable format
- Part 2: We look at sampling and bootstrapping to get a well sized dataset

### 1.) Connecting to Snowflake

To connect to your Snowflake instance, make sure you have all requirements installed and your connection details ready.

In [47]:
%load_ext sql
%config SqlMagic.autocommit=False # for engines that do not support autommit

In [48]:
##
## Make sure you have DATABASE_URL set & exported in your environment. Else run the following magic command:
##   Snowflake driver accepts the following parameters
##   URL = 'snowflake://<user_login_name>:<password>@<account_identifier>/<database_name>/<schema_name>?warehouse=<warehouse_name>&role=<role_name>'
##   Example:
##   %sql snowflake://user:password@xxxyyyzzz.west-europe.azure/DEMO/UDEMY?warehouse=PUBLIC
##

In [None]:
%sql SELECT 1 as "Connected"

### 2.) Bootstrapping data in Snowflake

Sometimes when statisticians are sampling data, they like to put each one back before grabbing the next. This is called random sampling with replacement, as opposed to random sampling without replacement, which is the more common thing.

If you use **random sampling with replacement, this is also known as bootstrapping**.

James Weakley posted a brilliant function in the Snowflake forum, see here for details: https://community.snowflake.com/s/article/Bootstrapping-at-Scale-in-Snowflake.

In [None]:
%%sql
create or replace function SAMPLE_PROBABILITY(SAMPLE_FRACTION DOUBLE, PROBABILITY_THRESHOLD DOUBLE)
    returns table (SAMPLE_PROBABILITY FLOAT)
    language javascript
    as '{
          processRow: function get_params(row, rowWriter, context){
             var iterations = Math.ceil(Math.log(row.PROBABILITY_THRESHOLD)/Math.log(1-row.SAMPLE_FRACTION));
             for (var i=0; i < iterations; i++){
               rowWriter.writeRow({SAMPLE_PROBABILITY: row.SAMPLE_FRACTION/iterations});
             }
             
          }
        }';


In [None]:
%%sql 

with bootstrapped_table as (
    SELECT *
    FROM blood_pressure, table(SAMPLE_PROBABILITY(0.6::double, 0.00001::double))
    WHERE uniform(0::float, 1::float, random()) < SAMPLE_PROBABILITY
)

select * from bootstrapped_table

### 3.) Sampling datasets in Snowflake

Returns a subset of rows sampled randomly from the specified table. The following sampling methods are supported:

- Sample a fraction of a table, with a specified probability for including a given row. The number of rows returned depends on the size of the table and the requested probability. A seed can be specified to make the sampling deterministic.

- Sample a fixed, specified number of rows. The exact number of specified rows is returned unless the table contains fewer rows.

SAMPLE and TABLESAMPLE are synonymous and can be used interchangeably. See also https://docs.snowflake.com/en/sql-reference/constructs/sample.html

##### Fixed Sample
Return a sample of a table in which each row has a 10% probability of being included in the sample:

In [None]:
%sql select * from blood_pressure sample (10);

##### Fraction of a table
Return a sample of a table in which each row has a 20.3% probability of being included in the sample:

In [None]:
%sql select * from blood_pressure tablesample bernoulli (20.3);

### 4.) Binning in Snowflake

Bin continuous data into intervals is helpful in some instances for your data science project.

https://community.snowflake.com/s/article/Feature-Engineering-in-Snowflake

In [None]:
%%sql

with aggregates as (
  select 
          min(bp_before) as min_bp_before,
          max(bp_before) as max_bp_before,
          min(bp_after) as min_bp_after,
          max(bp_after) as max_bp_after
  from blood_pressure
)

select 
    patient,
    width_bucket(bp_before, min_bp_before, max_bp_before, 5) as bp_before_bucket,
    width_bucket(bp_after,  min_bp_after,  max_bp_after , 5) as bp_after_bucket
    
from aggregates,blood_pressure

limit 10

### 5.) Linear regression

Snowflake has some really helpful functions to help with simple linear regression, which is **the only native machine learning algorithm** Snowflake supports.

The three key regression functions are REGR_SLOPE , REGR_INTERCEPT and REGR_R2 used to find the optimal slope, intercept and corresponding r-squared respectively but the rest are useful helpers functions!

Have a look here in this brilliant article by Simon Ward-Jones: https://www.simonwardjones.co.uk/posts/linear_regression_in_snowflake/