This is one of the Objectiv [example notebooks](https://objectiv.io/docs/modeling/example-notebooks/). These notebooks can also run [on your own data](https://objectiv.io/docs/modeling/get-started-in-your-notebook/) (see [how to set up tracking](https://objectiv.io/docs/tracking/)).

This example notebook shows how you can use Objectiv to create a basic feature set and use sklearn to do 
Machine Learning directly on the raw data in your SQL database. We also have an example that goes deeper into
[feature engineering](https://objectiv.io/docs/modeling/example-notebooks/feature-engineering/).

## Get started
We first have to instantiate the model hub and an Objectiv DataFrame object.

In [None]:
# set the timeframe of the analysis
start_date = '2022-03-01'
end_date = None

In [None]:
from modelhub import ModelHub, display_sql_as_markdown
from sklearn import cluster

# instantiate the model hub and set the default time aggregation to daily
modelhub = ModelHub(time_aggregation='%Y-%m-%d')
# get a Bach DataFrame with Objectiv data within a defined timeframe
df = modelhub.get_objectiv_dataframe(start_date=start_date, end_date=end_date)

This object points to all data in the dataset, which is too large to run in pandas and therefore sklearn. For 
the dataset that we need, we will aggregate to user level, at which point it is small enough to fit in memory.

### Reference
* [modelhub.ModelHub](https://objectiv.io/docs/modeling/open-model-hub/api-reference/ModelHub/ModelHub/)
* [modelhub.ModelHub.get_objectiv_dataframe](https://objectiv.io/docs/modeling/open-model-hub/api-reference/ModelHub/get_objectiv_dataframe/)

## Create the dataset
We'll create a dataset of all the root locations that a user clicked on, per user.

In [None]:
df['root'] = df.location_stack.ls.get_from_context_with_type_series(type='RootLocationContext', key='id')
# root series is later unstacked and its values might contain dashes
# which are not allowed in BigQuery column names, lets replace them
df['root'] = df['root'].str.replace('-', '_')

In [None]:
features = df[(df.event_type=='PressEvent')].groupby('user_id').root.value_counts()
features.head()

In [None]:
features_unstacked = features.unstack(fill_value=0)
# sample or not
kmeans_frame = features_unstacked
# for BigQuery the table name should be 'YOUR_PROJECT.YOUR_WRITABLE_DATASET.YOUR_TABLE_NAME'
kmeans_frame = features_unstacked.get_sample(table_name='kmeans_test', sample_percentage=50, overwrite=True)

Now we have a basic feature set that is small enough to fit in memory. This can be used with sklearn, as we demonstrate in this example.

### Reference
* [using global context data](open-taxonomy-how-to.ipynb#Location-stack-&-global-contexts)
* [modelhub.SeriesLocationStack.ls](https://objectiv.io/docs/modeling/open-model-hub/api-reference/SeriesLocationStack/ls/)
* [bach.DataFrame.groupby](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/groupby/)
* [bach.Series.value_counts](https://objectiv.io/docs/modeling/bach/api-reference/Series/value_counts/)
* [bach.DataFrame.head](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/head/)
* [bach.DataFrame.unstack](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/unstack/)
* [bach.DataFrame.get_sample](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/get_sample/)

## Export to pandas for sklearn

In [None]:
# export to pandas now
pdf = kmeans_frame.to_pandas()
pdf

### Reference
* [bach.DataFrame.to_pandas](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/to_pandas/)

## Do basic kmeans clustering
Now that we have a pandas DataFrame with our dataset, we can run basic kmeans clustering on it.

In [None]:
# do basic kmeans
est = cluster.KMeans(n_clusters=3)
est.fit(pdf)
pdf['cluster'] = est.labels_

Now you can use the created clusters on your entire dataset again if you add it back to your DataFrame. This is simple, as Bach and pandas work together nicely. Your original Objectiv data now has a 'cluster' column.

In [None]:
kmeans_frame['cluster'] = pdf['cluster']
kmeans_frame.sort_values('cluster').head()

In [None]:
df_with_cluster = df.merge(kmeans_frame[['cluster']], on='user_id')
df_with_cluster.head()

You can use this column just like any other. For example, you can now use your created clusters to group models from the model hub:

In [None]:
modelhub.aggregate.session_duration(df_with_cluster, groupby='cluster').head()

### Reference
* [bach.DataFrame.sort_values](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/sort_values/)
* [bach.DataFrame.head](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/head/)
* [bach.DataFrame.merge](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/merge/)
* [modelhub.Aggregate.session_duration](https://objectiv.io/docs/modeling/open-model-hub/models/aggregation/session_duration/)

## Get the SQL for any analysis
The SQL for any analysis can be exported with one command, so you can use models in production directly to simplify data debugging & delivery to BI tools like Metabase, dbt, etc. See how you can [quickly create BI dashboards with this](https://objectiv.io/docs/home/up#creating-bi-dashboards).

In [None]:
# show SQL for analysis; this is just one example, and works for any Objectiv model/analysis
display_sql_as_markdown(features)

That's it! [Join us on Slack](https://objectiv.io/join-slack) if you have any questions or suggestions.

# Next Steps

### Use this notebook with your own data

You can use the example notebooks on any dataset that was collected with Objectiv's tracker, so feel free to 
use them to bootstrap your own projects. They are available as Jupyter notebooks on our [GitHub repository](https://github.com/objectiv/objectiv-analytics/tree/main/notebooks). See [instructions to set up the Objectiv tracker](https://objectiv.io/docs/tracking/).

### Check out related example notebooks

* [Feature engineering](https://objectiv.io/docs/modeling/example-notebooks/feature-engineering/) - see how [modeling library Bach](https://objectiv.io/docs/modeling/bach/) can be used for feature engineering.