<a href="https://colab.research.google.com/github/ekaratnida/Data_Streaming_and_Realtime_Analytics/blob/main/Week12/Example-creditcard.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install river

In [None]:
dataset = datasets.CreditCard()
print(dataset)

Credit card frauds.

The datasets contains transactions made by credit cards in September 2013 by european
cardholders. This dataset presents transactions that occurred in two days, where we have 492
frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class
(frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation.
Unfortunately, due to confidentiality issues, we cannot provide the original features and more
background information about the data. Features V1, V2, ... V28 are the principal components
obtained with PCA, the only features which have not been transformed with PCA are 'Time' and
'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first
transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be
used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and
it tak

The dataset is a streaming dataset, and therefore doesn't sit in memory. Instead, we can loop over each sample with a `for` loop:

In [None]:
for x, y in dataset:
    pass

print(x)

Downloading https://maxhalford.github.io/files/datasets/creditcardfraud.zip (65.95 MB)
Uncompressing into /root/river_data/CreditCard
{'Time': 172792.0, 'V1': -0.53341252200504, 'V2': -0.189733337002305, 'V3': 0.703337366963779, 'V4': -0.506271240328258, 'V5': -0.0125456787599659, 'V6': -0.649616685713792, 'V7': 1.57700625437629, 'V8': -0.414650407552662, 'V9': 0.486179505267237, 'V10': -0.915426648905893, 'V11': -1.04045833522361, 'V12': -0.0315130540252157, 'V13': -0.188092900791737, 'V14': -0.0843164698151014, 'V15': 0.0413334553360658, 'V16': -0.302620086427415, 'V17': -0.660376645182784, 'V18': 0.16742993371973, 'V19': -0.256116871098099, 'V20': 0.382948104875066, 'V21': 0.261057330790975, 'V22': 0.643078437820093, 'V23': 0.376777014169917, 'V24': 0.00879737940024202, 'V25': -0.473648703898825, 'V26': -0.818267121041176, 'V27': -0.00241530880001015, 'V28': 0.0136489143320671, 'Amount': 217.0}


In [None]:
print(y)

0


Typically, models learn via a `learn_one(x, y)` method, which takes as input some features and a target value. Being able to learn with a single instance gives a lot of flexibility. For instance, a model can be updated whenever a new sample arrives from a stream. To exemplify this, let's train a logistic regression on the above dataset.

In [None]:
from river import linear_model

model = linear_model.LogisticRegression()

for x, y in dataset:
    model.learn_one(x, y)

Predictions can be obtained by calling a model's `predict_one` method. In the case of a classifier, we can also use `predict_proba_one` to produce probability estimates.

In [None]:
model = linear_model.LogisticRegression()

for x, y in dataset:
    y_pred = model.predict_proba_one(x)
    model.learn_one(x, y)
    
print(y_pred)

{False: 1.0, True: 0}


The `metrics` module gives access to many metrics that are commonly used in machine learning. Like the rest of `river`, these metrics can be updated with one element at a time:

In [None]:
from river import metrics

model = linear_model.LogisticRegression()

metric = metrics.ROCAUC()

for x, y in dataset:
    y_pred = model.predict_proba_one(x)
    model.learn_one(x, y)
    metric.update(y, y_pred)
    
metric

ROCAUC: 0.528647

A common way to improve the performance of a logistic regression is to scale the data. This can be done by using a `preprocessing.StandardScaler`. In particular, we can define a pipeline to organise our model into a sequence of steps:

In [None]:
from river import compose
from river import preprocessing

model = compose.Pipeline(
    preprocessing.StandardScaler(),
    linear_model.LogisticRegression()
)

model

In [None]:
metric = metrics.ROCAUC()

for x, y in dataset:
    y_pred = model.predict_proba_one(x)
    model.learn_one(x, y)
    metric.update(y, y_pred)
    
metric

ROCAUC: 0.891088

As we can see, the model is performing much better now that the data is being scaled. Under the hood, the standard scaler maintains a running average and a running variance for each feature in the dataset. Each feature is thus scaled according to the average and the variance seen up to every given point in time.

This concludes this short guide to getting started with `river`. There is a lot more to discover and understand. Head towards the [user guide](../user-guide/reading-data.md) for recipes on how to perform common machine learning tasks. You may also consult the [API reference](../api/overview.md), which is a catalogue of all the modules that `river` exposes. Finally, the [examples](../examples/batch-to-online.md) section contains comprehensive examples for various usecases.