## Scikit-Learn-style API

This example demontrates compatability with scikit-learn's basic `fit` API.
For demonstration, we'll use the perennial NYC taxi cab dataset.

In [1]:
import os
import s3fs
import pandas as pd
import dask.array as da
import dask.dataframe as dd
from distributed import Client

from dask import persist
# dask_glm.LogistricRegression (re)moved to dask_ml
from dask_ml.linear_model.glm import LogisticRegression

In [2]:
if not os.path.exists('trip.csv'):
    s3 = s3fs.S3FileSystem(anon=True)
    s3.get("dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv", "trip.csv")

In [3]:
client = Client()

In [4]:
ddf = dd.read_csv("trip.csv")

We can use the `dask.dataframe` API to explore the dataset, and notice that some of the values look suspicious:

In [5]:
ddf[['trip_distance', 'fare_amount']].describe().compute()

Unnamed: 0,trip_distance,fare_amount
count,12748990.0,12748990.0
mean,13.45913,11.90566
std,9844.094,10.30254
min,0.0,-450.0
25%,1.0,6.5
50%,1.7,9.0
75%,3.1,13.5
max,15420000.0,4008.0


Scikit-learn doesn't currently support filtering observations inside a pipeline ([yet](https://github.com/scikit-learn/scikit-learn/issues/3855)), so we'll do this before anything else.

In [6]:
# these filter out less than 1% of the observations
ddf = ddf[(ddf.trip_distance < 20) &
          (ddf.fare_amount < 150)]

Now, we'll split our DataFrame into a train and test set, and select our feature matrix and target column (whether the passenger tipped). To ensure this example runs quickly for the documentation, we'll make the training smaller than usual.

In [7]:
df_train, df_test = ddf.random_split([0.05, 0.95], random_state=2)

columns = ['VendorID', 'passenger_count', 'trip_distance', 'payment_type', 'fare_amount']

X_train, y_train = df_train[columns], df_train['tip_amount'] > 0
X_test, y_test = df_test[columns], df_test['tip_amount'] > 0

X_train = X_train.repartition(npartitions=2)
y_train = y_train.repartition(npartitions=2)

X_train, y_train, X_test, y_test = persist(
    X_train, y_train, X_test, y_test
)

With our training data in hand, we fit our logistic regression.
Nothing here should be surprising to those familiar with `scikit-learn`.

In [8]:
%%time
# this is a *dask-glm* LogisticRegresion, not scikit-learn
lm = LogisticRegression(fit_intercept=False)
lm.fit(X_train.values, y_train.values)

CPU times: user 4.99 s, sys: 1.48 s, total: 6.47 s
Wall time: 57.7 s


Again, following the lead of scikit-learn we can measure the performance of the estimator on the training dataset:

In [9]:
lm.score(X_train.values, y_train.values).compute()

0.88040294022117882

and on the test dataset:

In [10]:
lm.score(X_test.values, y_test.values).compute()

0.88089563102388546