<a href="https://colab.research.google.com/github/allront/GoogleDev/blob/master/Custom_Python_Model_Main_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Custom Model For Concrete Strength

This is a sample notebook we would use to build a custom model for a specific dataset. We have chosen to use a Concrete Strength dataset that describes the properties of concrete for thie purpose of demo during this workshop. 

If you'd like to write your own custom transformations, you can use this notebook with placeholders for you to fill in.
https://colab.research.google.com/drive/1eCx0Rhz6IxVR_QD0ikS4Npz6lO8sxzca

#### Extra libraries required in Colab

In [1]:
!pip install s3fs

Collecting s3fs
  Downloading s3fs-2022.3.0-py3-none-any.whl (26 kB)
Collecting aiohttp<=4
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 21.2 MB/s 
[?25hCollecting aiobotocore~=2.2.0
  Downloading aiobotocore-2.2.0.tar.gz (59 kB)
[K     |████████████████████████████████| 59 kB 8.0 MB/s 
[?25hCollecting fsspec==2022.3.0
  Downloading fsspec-2022.3.0-py3-none-any.whl (136 kB)
[K     |████████████████████████████████| 136 kB 55.6 MB/s 
[?25hCollecting botocore<1.24.22,>=1.24.21
  Downloading botocore-1.24.21-py3-none-any.whl (8.6 MB)
[K     |████████████████████████████████| 8.6 MB 50.4 MB/s 
Collecting aioitertools>=0.5.1
  Downloading aioitertools-0.10.0-py3-none-any.whl (23 kB)
Collecting asynctest==0.13.0
  Downloading asynctest-0.13.0-py3-none-any.whl (26 kB)
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.7.2-cp37-cp37m-manylinux_2_5_x86_64.m

#### Standard Libraries

In [2]:
import pandas as pd

### Loading and inspecting the data

We start by loading the dataset we are going to work with.

In [3]:
concrete_df = pd.read_csv('s3://abacusai.exampledatasets/predicting/concrete_measurements.csv')
concrete_df.describe()

Unnamed: 0,cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age,csMPa
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


#### Custom Data Transform

We are going to transform the dataset so that flyash is no longer a feature but instead all the other values are transformed according to whether they have flyash > 0 or not.

The example is not entirely realistic and it is certainly feasible to achieve the same result using SQL. However, the point is to illustrate that you are free to transform the dataset using the full functionality of python and its data frameworks. Here we are using pandas but you can use a wide range of standard python libraries to manipulate the data. Additionally, you can bundle resources with your code, for example small maps or tables, that can be accessed by your function to implement the transform.

In [4]:
def transform_concrete(concrete_dataset):
  import pandas as pd
  feature_df = concrete_dataset.drop(['flyash'], axis=1)
  no_flyash = feature_df[concrete_dataset.flyash == 0.0]
  flyash = feature_df[concrete_dataset.flyash > 0.0]
  return pd.concat([no_flyash - no_flyash.assign(age=0).mean(), flyash - flyash.assign(age=0).mean()])

transformed_concrete_df = transform_concrete(concrete_df)
transformed_concrete_df.describe()

Unnamed: 0,cement,slag,water,superplasticizer,coarseaggregate,fineaggregate,age,csMPa
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,-9.348811e-14,9.875854e-14,1.342995e-13,-4.492631e-15,-5.190963e-13,4.466899e-13,45.662136,-2.355829e-15
std,97.99162,81.27146,20.61237,5.481593,77.75244,79.59387,63.169912,16.67246
min,-212.0378,-100.1102,-60.01678,-8.826078,-172.3574,-172.2265,1.0,-34.44178
25%,-73.6722,-41.91875,-13.40776,-4.055654,-41.35742,-42.84638,7.0,-12.31678
50%,-4.037809,-21.91875,-0.3077586,-1.826078,-5.35742,-1.853004,28.0,-0.9444612
75%,57.0028,66.68975,11.38322,2.748922,57.01595,57.297,56.0,10.40121
max,263.9278,259.2898,71.59224,28.14435,171.6426,227.747,365.0,45.82822


### Custom Model

#### Requires catboost

In [5]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.0.5-cp37-none-manylinux1_x86_64.whl (76.6 MB)
[K     |████████████████████████████████| 76.6 MB 1.2 MB/s 
Installing collected packages: catboost
Successfully installed catboost-1.0.5


### Train Function

To illustrate that the training can be customized arbitrarily we will train a composite model that depending on the age of the concrete uses a linear model on quantized features or a GBDT trained on raw inputs. It is effectively a decision tree model on the age feature where one side of the decision tree delegates to a linear model and the other side to a boosted tree model.

In [6]:
def train(training_dataset):
  # set the seed for reproduceable results
  import numpy as np
  np.random.seed(5)

  X = training_dataset.drop(['csMPa'], axis=1)
  y = training_dataset.csMPa
  recent = training_dataset.age < 10
  from sklearn.preprocessing import QuantileTransformer
  from sklearn.linear_model import LinearRegression
  qt = QuantileTransformer(n_quantiles=20)
  recent_model = LinearRegression()
  _ = recent_model.fit(qt.fit_transform(X[recent]), y[recent])
  print(f'Linear model R^2 = {recent_model.score(qt.transform(X[recent]), y[recent])}')

  from catboost import Pool, CatBoostRegressor
  train_pool = Pool(X[~recent], y[~recent])
  older_model = CatBoostRegressor(iterations=5, depth=10, loss_function='RMSE')
  _ = older_model.fit(train_pool)
  metrics = older_model.eval_metrics(train_pool, ['RMSE'])
  old_r2 = 1 - metrics['RMSE'][-1]**2 / y[~recent].var()
  print(f'Catboost model R^2 = {old_r2}')

  return (X.columns, qt, recent_model, older_model)

local_model = train(transformed_concrete_df)

Linear model R^2 = 0.8387025158438364
Learning rate set to 0.5
0:	learn: 11.4988456	total: 109ms	remaining: 436ms
1:	learn: 8.8726781	total: 140ms	remaining: 211ms
2:	learn: 7.2674230	total: 192ms	remaining: 128ms
3:	learn: 6.1561252	total: 256ms	remaining: 64.1ms
4:	learn: 5.3489674	total: 309ms	remaining: 0us
Catboost model R^2 = 0.8751392374770929


### Prediction Function

In the example we are building the model is a composite model built on two partitions of the data so the prediction function needs to dispatch the input to the right model based on one of the input features.

In [7]:
def predict(model, query):
  columns, qt, recent_model, older_model = model
  import pandas as pd
  X = pd.DataFrame({c: [query[c]] for c in columns})
  if X.age[0] < 10:
    y = recent_model.predict(qt.transform(X))[0]
  else:
    y = older_model.predict(X.values.reshape(-1))
  return {'csMPa': y}

for _, r in transformed_concrete_df[transformed_concrete_df.age < 10][:5].iterrows():
  print(predict(local_model, r.to_dict()), r['csMPa'])

for _, r in transformed_concrete_df[transformed_concrete_df.age > 10][:5].iterrows():
  print(predict(local_model, r.to_dict()), r['csMPa'])

{'csMPa': -31.754124749801928} -28.711784452296826
{'csMPa': -5.324797742032466} 1.8282155477031736
{'csMPa': -4.377726654712582} -1.6917844522968295
{'csMPa': -23.147157848108044} -21.721784452296827
{'csMPa': -16.71201923334116} -10.511784452296826
{'csMPa': 25.433153968006927} 43.21821554770317
{'csMPa': 25.433153968006927} 25.118215547703173
{'csMPa': 6.059700620675216} 3.4982155477031753
{'csMPa': 6.059700620675216} 4.278215547703169
{'csMPa': 6.270561530680686} 7.528215547703169




## Integrate with Abacus.AI

Many data science projects stop at this point. Actually, most don't even clearly define the prediction operation that will be used in production applications. To actually leverage this model in production generally requires quite a bit more work:
- Storing the model so that it is available in various production workflows
- Hosting the model in a scalable manner so that it can be used for online predictions
- Support for evaluating the model against large batches of new data
- Monitoring the model to ensure its inputs and predictions have not shifted significantly

Beyond this specific features to support model usage there is the significant task of reliably keeping the model up-to-data as new data arrives. This involves a workflow of operations starting with the refresh of the input datasets through pushing the models to serving infrastructure.

Real world machine learning applications require performing all these operations reliably.

### Use Abacus.AI for all this and more

- [Sign up](https://abacus.ai/app/signup?signupToken=python_models) for an Abacus.AI Account
- Once your account is created, navigate to the [API Keys Dashboard](https://abacus.ai/app/profile/apikey) and generate an API key to authenticate your ApiClient

# Abacus.AI Integration Notebook

https://colab.research.google.com/drive/1AVvPE5Ue89l5n8Ed9eqdjAV5NQHMEMyl
