# Predict Bike Trips

In this example, we build a machine learning application to predict the number of trips in the next biking period. This application is structured into three important steps:

* Prediction Engineering
* Feature Engineering
* Machine Learning

In the first step, we generate new labels from the data by using [Compose](https://compose.alteryx.com/). In the second step, we generate features for the labels by using [Featuretools](https://docs.featuretools.com/). In the third step, we search for the best machine learning pipeline by using [EvalML](https://evalml.alteryx.com/). 
After working through these steps, you will learn how to build machine learning applications for real-world problems like forecasting demand. Let's get started.

In [None]:
%matplotlib inline
from demo.chicago_bike import load_sample
from matplotlib.pyplot import subplots
import composeml as cp
import featuretools as ft
import evalml

We will use data provided by Divvy which is a bicycle sharing system in Chicago. In this dataset, we have a record of each bike trip.

In [None]:
df = load_sample()

df.head()

## Prediction Engineering

> How many trips will occur from a station in the next biking period?

We can change the length of the biking period to create different prediction problems. For example, how many bike trips will occur in the next 4 hours or in the next week? These variations can be done by simply tweaking a parameter. This helps us explore different scenarios which is crucial for making better decisions.

### Defining the Labeling Process

Let's start by defining a labeling function to calculate the number of trips. Given that each observation is an individual trip, the number of trips is just the number of observations.

In [None]:
def trip_count(ds):
    return len(ds)

### Representing the Prediction Problem

Then, let's represent the prediction problem by creating a label maker with the following parameters:

* The `target_entity` as the column for the starting station ID, since we want to process trips from each starting station.
* The `labeling_function` as the function to calculate the number of trips.
* The `time_index` as the column for the start time of the trip. The biking peridos are based on this time index.
* The `window_size` as the length of a biking period. We can easily change this parameter to create variations of the prediction problem.

In [None]:
lm = cp.LabelMaker(
    target_entity='from_station_id',
    labeling_function=trip_count,
    time_index='starttime',
    window_size='13h',
)

### Finding the Training Examples

Now, let's run a search to get the training examples by using the following parameters:

* The trips sorted by the start time.
* `num_examples_per_instance` to find the number of training examples per station. In this case, we search for all existing examples.
* `minimum_data` as the start time of the first biking period. This is also the first cutoff time for building features.

In [None]:
lt = lm.search(
    df.sort_values('starttime'),
    num_examples_per_instance=-1,
    minimum_data='2014-06-30 08:00',
    verbose=False,
)

lt.head()

In [None]:
lt.describe()

In [None]:
fig, ax = subplots(nrows=2, ncols=1, figsize=(6, 8))
lt.plot.distribution(ax=ax[0])
lt.plot.count_by_time(ax=ax[1])
fig.tight_layout(pad=2)

## Feature Engineering

In [None]:
es = ft.EntitySet('chicago_bike')

es.entity_from_dataframe(
    dataframe=df.reset_index(),
    entity_id='trips',
    time_index='starttime',
    index='trip_id',
)
                  
es.normalize_entity(
    base_entity_id='trips',
    new_entity_id='from_station_id',
    index='from_station_id',
    make_time_index=False,
)

es.normalize_entity(
    base_entity_id='trips',
    new_entity_id='weather',
    index='events',
    make_time_index=False,
)                 

es.normalize_entity(
    base_entity_id='trips',
    new_entity_id='gender',
    index='gender',
    make_time_index=False,
)

es["trips"]["gender"].interesting_values = ['Male', 'Female']
es["trips"]["events"].interesting_values = ['tstorms']
es.plot()

In [None]:
fm, fd = ft.dfs(
    entityset=es,
    target_entity='from_station_id',
    trans_primitives=['hour', 'week', 'is_weekend'],
    cutoff_time=lt,
    cutoff_time_in_index=True,
    include_cutoff_time=False,
    verbose=False,
)

fm.head()

## Machine Learning

In [None]:
y = fm.pop('trip_count')
splits = evalml.preprocessing.split_data(fm, y, test_size=0.1, random_state=0, regression=True)
X_train, X_holdout, y_train, y_holdout = splits

In [None]:
automl = evalml.AutoMLSearch(problem_type='regression', objective='r2', random_state=0)
automl.search(X_train, y_train, data_checks='disabled', show_iteration_plot=False)

In [None]:
automl.best_pipeline.describe()
automl.best_pipeline.graph()

In [None]:
best_pipeline = automl.best_pipeline.fit(X_train, y_train)
score = best_pipeline.score(X_holdout, y_holdout, objectives=['r2'])
dict(score)

In [None]:
feature_importance = best_pipeline.feature_importance
feature_importance = feature_importance.set_index('feature')['importance']
top_k = feature_importance.abs().sort_values().tail(20).index
feature_importance[top_k].plot.barh(figsize=(8, 8), fontsize=14, width=.7);