This is one of the Objectiv [example notebooks](https://objectiv.io/docs/modeling/example-notebooks/). These notebooks can also run [on your own data](https://objectiv.io/docs/modeling/get-started-in-your-notebook/) (see [how to set up tracking](https://objectiv.io/docs/tracking/)).

# Logistic Regression
Data collected with Objectiv is [strictly structured & designed for modeling](https://objectiv.io/docs/taxonomy), making it ideal for various machine learning models, which can be applied directly without cleaning, transformations, or complex tooling.

This example notebook shows how you can predict user behavior with the [Logistic Regression model in the open model hub](https://objectiv.io/docs/modeling/open-model-hub/models/machine-learning/LogisticRegression/LogisticRegression/) on a full dataset collected with Objectiv. Examples of predictions you can create:

- Will a user convert?
- Will a user start using a specific product feature or area?
- Will a user have a long active session duration?

## Get started
We first have to instantiate the model hub and an Objectiv DataFrame object.

In [None]:
# set the timeframe of the analysis
start_date = '2022-03-01'
end_date = None

In [None]:
from modelhub import ModelHub, display_sql_as_markdown
from datetime import datetime

# instantiate the model hub and set the default time aggregation to daily
# and set the global contexts that will be used in this example
modelhub = ModelHub(time_aggregation='%Y-%m-%d')
# get a Bach DataFrame with Objectiv data within a defined timeframe
df = modelhub.get_objectiv_dataframe(start_date=start_date, end_date=end_date)

The `location_stack` column, and the columns taken from the global contexts, contain most of the event-specific data. These columns are JSON typed, and we can extract data from it using the keys of the JSON objects with [`SeriesLocationStack`](https://objectiv.io/docs/modeling/open-model-hub/api-reference/SeriesLocationStack/SeriesLocationStack/) methods, or the `context` accessor for global context columns. See the [open taxonomy example](open-taxonomy-how-to.ipynb#Location-stack-&-global-contexts) for how to use the `location_stack` and global contexts. 

In [None]:
df['root_location'] = df.location_stack.ls.get_from_context_with_type_series(type='RootLocationContext', key='id')

### Reference
* [modelhub.ModelHub](https://objectiv.io/docs/modeling/open-model-hub/api-reference/ModelHub/ModelHub/)
* [modelhub.ModelHub.get_objectiv_dataframe](https://objectiv.io/docs/modeling/open-model-hub/api-reference/ModelHub/get_objectiv_dataframe/)
* [Using global context data](open-taxonomy-how-to.ipynb#Location-stack-&-global-contexts)
* [modelhub.SeriesLocationStack.ls](https://objectiv.io/docs/modeling/open-model-hub/api-reference/SeriesLocationStack/ls/)

## Creating a feature set to predict user behavior

For simple demonstration purposes, we'll predict if users on our own [website](https://www.objectiv.io) will reach the [modeling section of our docs](https://objectiv.io/docs/modeling/), by looking at interactions they have with all the main sections on our site, as defined by the [root location](https://objectiv.io/docs/taxonomy/reference/location-contexts/RootLocationContext/).

We'll create a dataset that counts the number of clicks per user in each section. Note that this is a simple dataset used just for demonstration purposes of the logistic regression functionality, and not so much the results itself. For ins and outs on feature engineering see the [feature engineering notebook](https://objectiv.io/docs/modeling/example-notebooks/feature-engineering/).

In [None]:
# first replace dashes in the root_location Series, because is unstacked later on
# and dashes are not allowed in BigQuery column names
df['root_location'] = df['root_location'].str.replace('-', '_')

In [None]:
# look at the number of clicks per user in each section; only PressEvents, counting the root_locations
features = df[(df.event_type=='PressEvent')].groupby('user_id').root_location.value_counts()

In [None]:
# unstack the series, to create a DataFrame with the number of clicks per root location as columns
features_unstacked = features.unstack(fill_value=0)

### Reference
* [bach.DataFrame.groupby](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/groupby/)
* [bach.Series.value_counts](https://objectiv.io/docs/modeling/bach/api-reference/Series/value_counts/)
* [bach.DataFrame.unstack](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/unstack/)

### Sample the data
To limit data processing and speed up fitting, let's take a 10% sample of the full dataset to train the model on. After the model is fitted, it can easily be unsampled again to predict the labels for the _entire_ dataset.

In [None]:
# take a 10% sample to train the model on
# for BigQuery the table name should be 'YOUR_PROJECT.YOUR_WRITABLE_DATASET.YOUR_TABLE_NAME'
features_set_sample = features_unstacked.get_sample('test_lr_sample', sample_percentage=10, overwrite=True)

To predict whether a user clicked in the modeling section of our docs, we will look at the number of clicks in any of the other sections:
- `X` is a [DataFrame](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/) that contains the explanatory variables.
- `y` is a [SeriesBoolean](https://objectiv.io/docs/modeling/bach/api-reference/Series/Boolean/) with the labels we want to predict.

In [None]:
# set the explanatory variables and labels to predict
y_column = 'modeling'
y = features_set_sample[y_column] > 0
X = features_set_sample.drop(columns=[y_column])

In [None]:
# see what `X` looks like
X.head()

In [None]:
# and see what `y` looks like
y.head()

### Reference
* [bach.DataFrame.get_sample](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/get_sample/)
* [bach.DataFrame.head](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/head/)

## Instantiate & fit the logistic regression model
As the model is based on sklearn's version of LogisticRegression, it can be instantiated with any parameters that [sklearn's LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) supports. In our example we instantiate it with ``fit_intercept=False``.

In [None]:
lr = modelhub.get_logistic_regression(fit_intercept=False)

The `fit` operation then fits it to the passed data. This operation extracts the data from the database under the hood.

In [None]:
lr.fit(X, y)

### Reference
* [modelhub.ModelHub.get_logistic_regression](https://objectiv.io/docs/modeling/open-model-hub/api-reference/ModelHub/get_logistic_regression/)
* [modelhub.LogisticRegression.fit](https://objectiv.io/docs/modeling/open-model-hub/models/machine-learning/LogisticRegression/fit/)

## Set accuracy & prediction
All of the following operations are carried out directly on the database.

In [None]:
# see the score
lr.score(X, y)

The model provides the same attributes as [sklearn's Logistic Regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), such as `coef_`. 

In [None]:
# see the coefficients of the fitted model
lr.coef_

Now let's create columns for the predicted values and the labels in the dataset. Labels are set to `True` if the probability is over 0.5.

In [None]:
# create columns for predicted values and labels
features_set_sample['predicted_values'] = lr.predict_proba(X)
features_set_sample['predicted_labels'] = lr.predict(X)

In [None]:
# see the sampled data set, including predictions
features_set_sample.head(20)

### Reference
* [modelhub.LogisticRegression.score](https://objectiv.io/docs/modeling/open-model-hub/models/machine-learning/LogisticRegression/score/)
* [sklearn's Logistic Regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
* [modelhub.LogisticRegression.predict_proba](https://objectiv.io/docs/modeling/open-model-hub/models/machine-learning/LogisticRegression/predict_proba/)
* [modelhub.LogisticRegression.predict](https://objectiv.io/docs/modeling/open-model-hub/models/machine-learning/LogisticRegression/predict/)
* [bach.DataFrame.head](https://objectiv.io/docs/modeling/bach/api-reference/DataFrame/head/)

## Unsample and get the SQL
The sampled dataset we used above can easily be unsampled.

In [None]:
features_set_full = features_set_sample.get_unsampled()

The SQL for any analysis can be exported with one command, so you can use models in production directly to simplify data debugging & delivery to BI tools like Metabase, dbt, etc. See how you can [quickly create BI dashboards with this](https://objectiv.io/docs/home/up#creating-bi-dashboards).

In [None]:
# show the underlying SQL for this dataframe - works for any dataframe/model in Objectiv
display_sql_as_markdown(features_set_full)

That’s it! Stay tuned for more metrics to assess model fit, as well as simplifying splitting the data into 
training and testing datasets.

[Join us on Slack](https://objectiv.io/join-slack) if you have any questions or suggestions.

# Next Steps


## Use this notebook with your own data
You can use the example notebooks on any dataset that was collected with Objectiv's tracker, so feel free to 
use them to bootstrap your own projects. They are available as Jupyter notebooks on our [GitHub repository](https://github.com/objectiv/objectiv-analytics/tree/main/notebooks). See [instructions to set up the Objectiv tracker](https://objectiv.io/docs/tracking/).


## Check out related example notebooks
- [User Intent analysis](./basic-user-intent.ipynb) - run basic User Intent analysis with Objectiv.