## How-to guide for Cloud Spend use-case on Abacus.AI platform
This notebook provides you with a hands on environment to build an cloud spend alert model using the Abacus.AI Python Client Library.

We'll be using the [Cloud Spend Dataset](https://s3.amazonaws.com//realityengines.exampledatasets/cloud_operations/cloud_spend.csv), which, as the name suggests, contains information about cloud spends.

1. Install the Abacus.AI library

In [1]:
!pip install abacusai

Collecting abacusai
  Downloading abacusai-0.32.9.tar.gz (40 kB)
[K     |████████████████████████████████| 40 kB 344 kB/s 
[?25hCollecting fastavro
  Downloading fastavro-1.4.7-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 748 kB/s 
[?25hCollecting packaging
  Downloading packaging-21.3-py3-none-any.whl (40 kB)
[K     |████████████████████████████████| 40 kB 4.0 MB/s 
[?25hCollecting pandas
  Downloading pandas-1.3.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB)
[K     |████████████████████████████████| 11.5 MB 359 kB/s 
Collecting pyparsing!=3.0.5,>=2.0.2
  Downloading pyparsing-3.0.6-py3-none-any.whl (97 kB)
[K     |████████████████████████████████| 97 kB 413 kB/s 
[?25hCollecting numpy>=1.17.3; platform_machine != "aarch64" and platform_machine != "arm64" and python_version < "3.10"
  Downloading numpy-1.21.4-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |█

We'll also import pandas and pprint tools for visualization in this notebook:

In [2]:
import pandas as pd # A tool we'll use to download and preview CSV files
import pprint # A tool to pretty print dictionary outputs
pp = pprint.PrettyPrinter(indent=2)

2. Add your Abacus.AI [API Key](https://abacus.ai/app/profile/apikey) generated using the API dashboard as follows:

In [3]:
#@title Abacus.AI API Key

api_key = '2fdecde877dc45fab937eff82b70eff0'  #@param {type: "string"}

3. Import the Abacus.AI library and instantiate a client.

In [4]:
from abacusai import ApiClient
client = ApiClient(api_key)

## 1. Create a Project

Abacus.AI projects are containers that have datasets and trained models. By specifying a business **Use Case**, Abacus.AI tailors the deep learning algorithms to produce the best performing model catered specifically for your data.

We'll call the `list_use_cases` method to retrieve a list of the Use Cases currently available on the Abacus.AI platform.

In [5]:
client.list_use_cases()

[UseCase(use_case='UCPLUGANDPLAY',
   pretty_name='Plug & Play Your Tensorflow Model',
   description='Upload your already trained model and leverage our model serving infrastructure.. Host your models on our infrastructure and get a JSON api with auto scaling and more!'),
 UseCase(use_case='EMBEDDINGS_ONLY',
   pretty_name='Vector Matching Engine',
   description='Upload embeddings and leverage our similarity search infrastructure.. Scale to high traffic, update your index in near realtime'),
 UseCase(use_case='MODEL_WITH_EMBEDDINGS',
   pretty_name='Tensorflow Model With Vector Matching Engine',
   description='Upload your already trained model and leverage our model serving infrastructure.. Host your models on our infrastructure and get a JSON api with auto scaling and more!'),
 UseCase(use_case='TORCH_MODEL_WITH_EMBEDDINGS',
   pretty_name='PyTorch Model With Vector Matching Engine',
   description='Upload your already trained model and leverage our model serving infrastructure.. H

In this notebook, we're going to create a cloud spend alert model using the Cloud Spend dataset. The 'OPERATIONS_CLOUD' use case is best tailored for this situation.

In [6]:
#@title Abacus.AI Use Case

use_case = 'OPERATIONS_CLOUD'  #@param {type: "string"}

By calling the `describe_use_case_requirements` method we can view what datasets are required for this use_case.

In [7]:
for requirement in client.describe_use_case_requirements(use_case):
  pp.pprint(requirement.to_dict())

{ 'allowed_feature_mappings': { 'DATE': { 'allowed_feature_types': [ 'TIMESTAMP'],
                                          'description': 'Date and time that '
                                                         'corresponds to the '
                                                         'value of the item.',
                                          'required': True},
                                'IGNORE': { 'description': 'Ignore this column '
                                                           'in training',
                                            'multiple': True,
                                            'required': False},
                                'SERVICE_ID': { 'allowed_feature_types': [ 'CATEGORICAL'],
                                                'description': 'The unique ID '
                                                               'of the service '
                                                               'that needs to '
      

Finally, let's create the project.

In [8]:
cloud_project = client.create_project(name='Cloud Spend Project', use_case=use_case)
cloud_project.to_dict()

{'project_id': '1388fabef0',
 'name': 'Cloud Spend Project',
 'use_case': 'OPERATIONS_CLOUD',
 'created_at': '2021-11-23T18:37:56+00:00',
 'feature_groups_enabled': True}

**Note: When feature_groups_enabled is True then the use case supports feature groups (collection of ML features). Feature groups are created at the organization level and can be tied to a project to further use it for training ML models**

## 2. Add Datasets to your Project

Abacus.AI can read datasets directly from `AWS S3` or `Google Cloud Storage` buckets, otherwise you can also directly upload and store your datasets with Abacus.AI. For this notebook, we will have Abacus.AI read the datasets directly from a public S3 bucket's location.

We are using one dataset for this notebook. We'll tell Abacus.AI how the dataset should be used when creating it by tagging the dataset with a special Abacus.AI **Dataset Type**.
- [Cloud Spend Dataset](https://s3.amazonaws.com//realityengines.exampledatasets/cloud_operations/cloud_spend.csv) (**TIMESERIES**): 
This dataset contains information about a company's cloud expenses, including the start date, service, and log cost.

### Add the dataset to Abacus.AI

First we'll use Pandas to preview the file, then add it to Abacus.AI.

In [9]:
pd.read_csv('https://s3.amazonaws.com//realityengines.exampledatasets/cloud_operations/cloud_spend.csv')

Unnamed: 0,UsageStartDate,service,LogCost
0,2019-01-28 23:00:00+00:00,AWSCloudTrail,0.285419
1,2019-01-28 23:00:00+00:00,AWSQueueService,-0.145920
2,2019-01-28 23:00:00+00:00,AmazonCloudFront,-0.430381
3,2019-01-28 23:00:00+00:00,AmazonCloudWatch,0.080604
4,2019-01-28 23:00:00+00:00,AmazonEC2,0.151666
...,...,...,...
46444,2019-08-31 23:00:00+00:00,AmazonEC2,0.226961
46445,2019-08-31 23:00:00+00:00,AmazonKinesis,-0.033456
46446,2019-08-31 23:00:00+00:00,AmazonRDS,0.249469
46447,2019-08-31 23:00:00+00:00,AmazonS3,-0.314570


Using the Create Dataset API, we can tell Abacus.AI the public S3 URI of where to find the datasets. We will also give each dataset a Refresh Schedule, which tells Abacus.AI when it should refresh the dataset (take an updated/latest copy of the dataset).

If you're unfamiliar with Cron Syntax, Crontab Guru can help translate the syntax back into natural language: [https://crontab.guru/#0_12_\*_\*_\*](https://crontab.guru/#0_12_*_*_*)

**Note: This cron string will be evaluated in UTC time zone**

In [13]:
cloud_dataset = client.create_dataset_from_file_connector(name='Test Cloud Spend Data', table_name='Test_Cloud_Spend_Data' ,
                                     location='s3://realityengines.exampledatasets/cloud_operations/cloud_spend.csv',
                                     refresh_schedule='0 12 * * *')
datasets = [cloud_dataset]

## 3. Create Feature Groups and add them to your Project

Datasets are created at the organization level and can be used to create feature groups as follows:

In [16]:
feature_group = client.create_feature_group(table_name='test_cloud_spends_alert',sql='SELECT * FROM Test_Cloud_Spend')

Adding Feature Group to the project:

In [17]:
client.add_feature_group_to_project(feature_group_id=feature_group.feature_group_id,project_id = cloud_project.project_id)

Setting the Feature Group type according to the use case requirements:

In [18]:
client.set_feature_group_type(feature_group_id=feature_group.feature_group_id, project_id = cloud_project.project_id, feature_group_type= "TIMESERIES")

Check current Feature Group schema:

In [19]:
client.get_feature_group_schema(feature_group_id=feature_group.feature_group_id)

[Feature(name='UsageStartDate',
   select_clause=None,
   feature_mapping=None,
   source_table='Test_Cloud_Spend_Data',
   original_name=None,
   using_clause=None,
   order_clause=None,
   where_clause=None,
   feature_type='TIMESTAMP',
   data_type='DATETIME',
   columns=None,
   point_in_time_info=None),
 Feature(name='service',
   select_clause=None,
   feature_mapping=None,
   source_table='Test_Cloud_Spend_Data',
   original_name=None,
   using_clause=None,
   order_clause=None,
   where_clause=None,
   feature_type='CATEGORICAL',
   data_type='STRING',
   columns=None,
   point_in_time_info=None),
 Feature(name='LogCost',
   select_clause=None,
   feature_mapping=None,
   source_table='Test_Cloud_Spend_Data',
   original_name=None,
   using_clause=None,
   order_clause=None,
   where_clause=None,
   feature_type='NUMERICAL',
   data_type='FLOAT',
   columns=None,
   point_in_time_info=None)]

#### For each **Use Case**, there are special **Column Mappings** that must be applied to a column to fulfill use case requirements. We can find the list of available **Column Mappings** by calling the *Describe Use Case Requirements* API:

In [21]:
client.describe_use_case_requirements(use_case)[0].allowed_feature_mappings

{'SERVICE_ID': {'description': 'The unique ID of the service that needs to be monitored for possible anomalies.',
  'allowed_feature_types': ['CATEGORICAL'],
  'required': True},
 'VALUE': {'description': 'The target value of the item being monitored.',
  'allowed_feature_types': ['NUMERICAL'],
  'required': True},
 'DATE': {'description': 'Date and time that corresponds to the value of the item.',
  'allowed_feature_types': ['TIMESTAMP'],
  'required': True},
 'IGNORE': {'description': 'Ignore this column in training',
  'multiple': True,
  'required': False}}

In [22]:
client.set_feature_mapping(project_id = cloud_project.project_id,feature_group_id= feature_group.feature_group_id, feature_name='UsageStartDate',feature_mapping='DATE')
client.set_feature_mapping(project_id = cloud_project.project_id,feature_group_id= feature_group.feature_group_id, feature_name='service',feature_mapping='SERVICE_ID')
client.set_feature_mapping(project_id = cloud_project.project_id,feature_group_id= feature_group.feature_group_id, feature_name='LogCost',feature_mapping='VALUE')

[Feature(name='UsageStartDate',
   select_clause=None,
   feature_mapping='DATE',
   source_table='Test_Cloud_Spend_Data',
   original_name=None,
   using_clause=None,
   order_clause=None,
   where_clause=None,
   feature_type='TIMESTAMP',
   data_type='DATETIME',
   columns=None,
   point_in_time_info=None),
 Feature(name='service',
   select_clause=None,
   feature_mapping='SERVICE_ID',
   source_table='Test_Cloud_Spend_Data',
   original_name=None,
   using_clause=None,
   order_clause=None,
   where_clause=None,
   feature_type='CATEGORICAL',
   data_type='STRING',
   columns=None,
   point_in_time_info=None),
 Feature(name='LogCost',
   select_clause=None,
   feature_mapping='VALUE',
   source_table='Test_Cloud_Spend_Data',
   original_name=None,
   using_clause=None,
   order_clause=None,
   where_clause=None,
   feature_type='NUMERICAL',
   data_type='FLOAT',
   columns=None,
   point_in_time_info=None)]

For each required Feature Group Type within the use case, you must assign the Feature group to be used for training the model:

In [23]:
client.use_feature_group_for_training(project_id=cloud_project.project_id, feature_group_id=feature_group.feature_group_id)

Now that we've our feature groups assigned, we're almost ready to train a model!

To be sure that our project is ready to go, let's call project.validate to confirm that all the project requirements have been met:

In [24]:
cloud_project.validate()

ProjectValidation(valid=True,
  dataset_errors=[],
  column_hints={})

## 4. Train a Model

For each **Use Case**, Abacus.AI has a bunch of options for training. We can call the *Get Training Config Options* API to see the available options.

In [25]:
cloud_project.get_training_config_options()

[TrainingConfigOptions(name='TEST_SPLIT',
   data_type='INTEGER',
   value_type=None,
   value_options=None,
   value=None,
   default=10,
   options={'range': [5, 20]},
   description='Percent of dataset to use for test data. We support using a range between 5% to 20% of your dataset to use as test data.',
   required=None,
   last_model_value=None),
 TrainingConfigOptions(name='DROPOUT_RATE',
   data_type='INTEGER',
   value_type=None,
   value_options=None,
   value=None,
   default=None,
   options={'range': [0, 90]},
   description='Dropout percentage rate.',
   required=None,
   last_model_value=None),
 TrainingConfigOptions(name='BATCH_SIZE',
   data_type='ENUM',
   value_type=None,
   value_options=None,
   value=None,
   default=None,
   options={'values': [8, 16, 32, 64, 128, 256, 384, 512, 740, 1024]},
   description='Batch size.',
   required=None,
   last_model_value=None)]

In this notebook, we'll just train with the default options, but definitely feel free to experiment, especially if you have familiarity with Machine Learning.

In [26]:
cloud_model = cloud_project.train_model(training_config={})
cloud_model.to_dict()

{'name': 'Cloud Spend Project Model',
 'model_id': '1baa847ea',
 'model_config': {},
 'created_at': '2021-11-23T18:42:56+00:00',
 'project_id': '1388fabef0',
 'shared': False,
 'shared_at': None,
 'train_function_name': None,
 'predict_function_name': None,
 'training_input_tables': None,
 'source_code': None,
 'location': None,
 'refresh_schedules': None,
 'latest_model_version': {'model_version': 'd93e89b44',
  'status': 'PENDING',
  'model_id': '1baa847ea',
  'model_config': {},
  'training_started_at': None,
  'training_completed_at': None,
  'dataset_versions': None,
  'error': None,
  'pending_deployment_ids': None,
  'failed_deployment_ids': None}}

After we start training the model, we can call this blocking call that routinely checks the status of the model until it is trained and evaluated.

In [27]:
cloud_model.wait_for_evaluation()

Model(name='Cloud Spend Project Model',
  model_id='1baa847ea',
  model_config={},
  created_at='2021-11-23T18:42:56+00:00',
  project_id='1388fabef0',
  shared=False,
  shared_at=None,
  train_function_name=None,
  predict_function_name=None,
  training_input_tables=None,
  source_code=None,
  location=None,
  refresh_schedules=None,
  latest_model_version=ModelVersion(model_version='d93e89b44',
  status='COMPLETE',
  model_id='1baa847ea',
  model_config={},
  training_started_at='2021-11-23T18:44:41+00:00',
  training_completed_at='2021-11-23T19:04:17+00:00',
  dataset_versions=['45f68d7de'],
  error=None,
  pending_deployment_ids=[],
  failed_deployment_ids=[]))

**Note that model training might take some minutes to some hours depending upon the size of datasets, complexity of the models being trained and a variety of other factors**

## **Checkpoint** [Optional]
As model training can take an hours to complete, your page could time out or you might end up hitting the refresh button, this section helps you restore your progress:

In [None]:
!pip install abacusai
import pandas as pd
import pprint
pp = pprint.PrettyPrinter(indent=2)
api_key = ''  #@param {type: "string"}
from abacusai import ApiClient
client = ApiClient(api_key)
cloud_project = next(project for project in client.list_projects() if project.name == 'Cloud Spend Project')
cloud_model = cloud_project.list_models()[-1]
cloud_model.wait_for_evaluation()

## Evaluate your Model Metrics

After your model is done training you can inspect the model's quality by reviewing the model's metrics:

In [None]:
pp.pprint(cloud_model.get_metrics().to_dict())

To get a better understanding on what these metrics mean, visit our [documentation](https://abacus.ai/app/help/useCases/OPERATIONS_CLOUD/training) page.

## 5. Deploy Model

After the model has been trained, we need to deploy the model to be able to start making predictions. Deploying a model will reserve cloud resources to host the model for Realtime and/or batch predictions.

In [None]:
cloud_deployment = client.create_deployment(name='Cloud Spend Deployment', model_id=cloud_model.model_id,description='Cloud Spend Deployment')
cloud_deployment.wait_for_deployment()

After the model is deployed, we need to create a deployment token for authenticating prediction requests. This token is only authorized to predict on deployments in this project, so it's safe to embed this token inside of a user-facing application or website.

In [None]:
deployment_token = cloud_project.create_deployment_token().deployment_token
deployment_token

## 6. Predict


Now that you have an active deployment and a deployment token to authenticate requests, you can make the `get_anomalies` API call below.

This command will return information about a company's cloud spending, including the threshold, service, and anomalies. The anomaly detection would be performed using the data in the Cloud Spend dataset.

In [None]:
ApiClient().get_anomalies(deployment_token=deployment_token, 
               deployment_id=cloud_deployment.deployment_id)