## Feature Discovery with Instacart Dataset

**Author: João Gomes**

### Scope
The scope of this Notebook is to provide an end to end example of how to use Feature Discovery with DataRobot Python API.

### Background

Feature Discovery is used to automatically aggregate data of different granularities while generating hundreds of features. This can significantly reduce the time investment needed from data scientists and data engineers alike to bring data together and start modeling. To learn more about feature discovery, check our community article [here](https://community.datarobot.com/t5/resources/feature-discovery-with-datarobot/ta-p/4972).

The dataset we will be using is a sampled version of the well known instacart dataset. More information on it can be found [here](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2).


### Requirements

- Python version 3.7.3
- DataRobot API version > 2.21.0

Full documentation of the Python package can be found here: https://datarobot-public-api-client.readthedocs-hosted.com

#### Import Libraries
To start with, lets import the libraries that will be used in this tutorial

In [1]:
import datarobot as dr
import time

#### Connect to DataRobot

Connect to DataRobot using your api token and your endpoint. Change input below accordingly.

In [3]:
dr.Client(token ='YOUR_TOKEN' ,
          endpoint ='YOUR_ENDPOINT')

<datarobot.rest.RESTClientObject at 0x7ffc655ddc18>

In [3]:
primary_dataset = dr.Dataset.create_from_file('data/train500.csv')
project = dr.Project.create_from_dataset(primary_dataset.id, project_name='Instacart FD API')

#### Create Secondary Datasets

In [4]:
orders_dataset = dr.Dataset.create_from_file(file_path='data/orders.csv')
order_products = dr.Dataset.create_from_file(file_path='data/order_products.csv')

#### Define definitions and relationships
Change below based on your problem.

In [5]:
dataset_definitions = [
    {
        'identifier': 'orders',
        'catalogVersionId': orders_dataset.version_id,
        'catalogId': orders_dataset.id,
        'primaryTemporalKey': 'order_time',
        'snapshotPolicy': 'latest',
    },
    {
        'identifier': 'order_products',
        'catalogId': order_products.id,
        'catalogVersionId': order_products.version_id,
        'snapshotPolicy': 'latest',
    },
]

relationships = [
    {
        'dataset2Identifier': 'orders',
        'dataset1Keys': ['user_id'],
        'dataset2Keys': ['user_id'],
        'featureDerivationWindowStart': -30,
        'featureDerivationWindowEnd': 0,
        'featureDerivationWindowTimeUnit': 'DAY',
        'predictionPointRounding': 1,
        'predictionPointRoundingTimeUnit': 'DAY',
    },
    {
        'dataset1Identifier': 'orders',
        'dataset2Identifier': 'order_products',
        'dataset1Keys': ['order_id'],
        'dataset2Keys': ['order_id'],
    },
]

# Create the relationships configuration to define connection between the datasets
relationship_config = dr.RelationshipsConfiguration.create(dataset_definitions=dataset_definitions, relationships=relationships)

#### Prepare Datetime partitioning

In [2]:
partitioning_spec = dr.DatetimePartitioningSpecification('time')

#### Start Project

In [None]:
project.set_target(target='will_buy_bananas', relationships_configuration_id=relationship_config.id, partitioning_method=partitioning_spec)
project.wait_for_autopilot()

#### Get predictions from test set

In [7]:
# prepare to get predictions for test set
dataset = project.upload_dataset(path +"test.csv")

model = dr.ModelRecommendation.get(
    project.id,
    dr.enums.RECOMMENDED_MODEL_TYPE.RECOMMENDED_FOR_DEPLOYMENT
).get_model()

pred_job = model.request_predictions(dataset.id)
preds = pred_job.get_result_when_complete()

In [8]:
preds.head()

Unnamed: 0,prediction_threshold,prediction,row_id,positive_probability,class_0.0,class_1.0
0,0.5,1.0,0,0.899787,0.100213,0.899787
1,0.5,1.0,1,0.96937,0.03063,0.96937
2,0.5,0.0,2,0.385222,0.614778,0.385222
3,0.5,0.0,3,0.15127,0.84873,0.15127
4,0.5,1.0,4,0.870262,0.129738,0.870262
