# Training a simple logistic regression model with dbt and fal

This notebook showcases how you can you fal and dbt to build a machine learning model and deploy it in your dbt pipeline.

We start by installing the dependencies.

In [None]:
!pip install -r ../requirements.txt

Next, we use `dbt seed` to load raw data onto the data warehouse.

In [None]:
!dbt seed --profiles-dir ..

Now we can run our dbt models.

In [None]:
!dbt run -s customer_orders customer_orders_labeled --profiles-dir ..

In this next cell, we import all the necessary modules.

In [None]:
from fal import FalDbt

Initialize FalDbt:

In [None]:
faldbt = FalDbt(project_dir="..", profiles_dir="..")

## Part 1: Training a new machine learning model

Downloading the `customer_orders_labeled` model as a pandas DataFrame and printing the top rows of this DataFrame.

In [None]:
orders_df = faldbt.ref("customer_orders_labeled")
orders_df.head()

In [None]:
print('Summary statistics:\n', orders_df.describe())

Let's plot a sample from this DataFrame to see what our data actually looks like. Red dots should represent order that were returned and blue dots are the ones that were not returned.

In [None]:
import matplotlib.pyplot as plt
# Sample data for plot
plot_data = orders_df.sample(frac=0.1, random_state=123)

colors = ['red' if r else 'blue' for r in plot_data['return']]  # assign colors based on whether or not order was returned

plt.scatter(plot_data['age'], plot_data['total_price'], c=colors)
plt.xlabel('Age')
plt.ylabel('Total Price')
plt.show()

It's time now to train a simple logistic regression model. We use `LogisticRegression` class from scikit-learn.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Train logistic regression model
X = orders_df[['age', 'total_price']]
y = orders_df['return']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

lr_model = LogisticRegression(random_state=123)
lr_model.fit(X_train, y_train)

# Test model
y_pred = lr_model.predict(X_test)
print(classification_report(y_test, y_pred))


We can live with 86% accuracy.

## Part 2: Making batch predictions with stored models
We start this section by downloading the `customer_orders` models as a DataFrame and printing it's head.

In [None]:
orders_new_df = faldbt.ref("customer_orders")

orders_new_df.head()

As we can see, `customer_orders` doesn't have the "returned" column. That's what we're trying to predict. We've put our model training code in `order_return_prediction_models` Python model. In this next cell, we download that model and pick the most accurate model.

In [None]:
models_df = faldbt.ref("order_return_prediction_models")
best_model_df = models_df[models_df.accuracy == models_df.accuracy.max()]
model_name = best_model_df.model_name[0]

In our example, the ML models are stored in the `ml_models` directory. In production, you'll want to use a cloud storage provider, such as S3 or GCS. But here we load the target ML model by simply openning the file.

In [None]:
import pickle

with open(f"../ml_models/{model_name}.pkl", "rb") as f:
    loaded_model = pickle.load(f)

Finally, it's time to make some predictions:

In [None]:
predictions = loaded_model.predict(orders_new_df[["age", "total_price"]])
orders_new_df["predicted_return"] = predictions

In [None]:
orders_new_df.sample(frac=0.5, random_state=123)

Let's plot our predictions, to see if they make sense:

In [None]:
# Sample data for plot
plot_data = orders_new_df.sample(frac=0.5, random_state=123)

colors = ['red' if r else 'blue' for r in plot_data['predicted_return']]
plt.scatter(plot_data['age'], plot_data['total_price'], c=colors)
plt.xlabel('Age')
plt.ylabel('Total Price')
plt.show()

That seems about right!

In [None]:
new_df = faldbt.ref("order_return_predictions")
new_df.head()

## Part 3: Making single row predictions
We can also write a function that accepts features and returns a label:

In [None]:
def predict_return(age: float, total_price: float) -> bool:
    import pandas as pd
    df = pd.DataFrame({"age": [age], "total_price": [total_price]})
    predictions = loaded_model.predict(df[["age", "total_price"]])
    return predictions[0] == 1

Now, making single predictions is easy:

In [None]:
predict_return(18.0, 400)

For more information about this example, see our [blog post](https://blog.fal.ai/build-and-deploy-machine-learning-models-from-jupyter-notebooks-with-fal-and-dbt/) that gives you a detailed walk through of both this notebook and then incorporating it's contents inside dbt Python models.