# Prediction metrics for Flow

This notebook contains base model with metrics for data cosisting only of flow information.
This flow is affected by rain but we don't have rain data here.
 
This dataset has 36 months of data. 

Our model will predict flow for the next 24h. We will use all data up to the predicted day for training model, and then validate our prediction on the next day not seen in the trainning.

In [1]:
import datetime
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams['figure.figsize'] = 12, 4

## Load dataset

In [2]:
dataset = pd.read_csv('../datasets/flow.csv.gz', compression='gzip', parse_dates=['time'])
print(dataset.info())
dataset.tail()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 438446 entries, 0 to 438445
Data columns (total 2 columns):
time    438446 non-null datetime64[ns]
flow    438446 non-null float64
dtypes: datetime64[ns](1), float64(1)
memory usage: 6.7 MB
None


Unnamed: 0,time,flow
438441,2017-11-10 14:20:00,90.297165
438442,2017-11-10 14:25:00,90.17769
438443,2017-11-10 14:30:00,90.52699
438444,2017-11-10 14:35:00,91.27784
438445,2017-11-10 14:40:00,92.04197


## Helper functions

In [3]:
def split(df, day):
    """
    Split dataset into training set and test set on a given day.
    All data before given day will be added to the training set, 
    while data from the given day will be used to create test set
    """
    next_day = day + pd.Timedelta(1, 'D')
    train = df[df.time < day]
    test = df[(df.time >= day) & (df.time < next_day)]
    X_train = train[['time']]
    Y_train = train['flow']
    X_test = test[['time']]
    Y_test = test['flow']
    return X_train, Y_train, X_test, Y_test


def rmse(y_hat, y):
    """
    Calculate Root Mean Square Error
    """
    return np.sqrt(mean_squared_error(y_hat, y))


X_train, Y_train, X_test, Y_test = split(dataset, pd.Timestamp('2015-04-25'))
print(rmse(Y_test, Y_test))
X_train.tail()

0.0


Unnamed: 0,time
170463,2015-04-24 23:35:00
170464,2015-04-24 23:40:00
170465,2015-04-24 23:45:00
170466,2015-04-24 23:50:00
170467,2015-04-24 23:55:00


## Build model

Here we will build 2 naive models. 

First on always predicts 0 value. The second one predicts mean value. 

We will evaluate which model is better

In [4]:
class ZeroModel:
    """
    This model always predicts 0
    """
    
    def fit(self, X, y):
        pass
        
    def predict(self, X):
        return np.zeros(len(X))
    
    
class MeanModel:
    """
    This naive model predicts mean value based on train labels (We work with regression task)    
    """
    
    def __init__(self):
        self.mu = 0
    
    def fit(self, X, y):
        """Calculate mean"""
        self.mu = np.mean(y)
        
    def predict(self, X):
        """
        Predict mean value.
        Returns:
        numpy arrray with as many rows as X has with the constant value.
        """
        return np.ones(len(X)) * self.mu    

# Evaluate model

Here we will evaluate our model for each day in the last 12 months. And then report 90th percentile as a model accuracy.

First let define function for evaluating our model

In [5]:
def evaluate_model(model):
    split_day = pd.Timestamp('2016-11-11')
    costs = []

    while True:
        X_train, Y_train, X_test, Y_test = split(dataset, split_day)
        if len(X_test) == 0:
            break
        model.fit(X_train, Y_train)
        cost = rmse(model.predict(X_test), Y_test)
        costs.append(cost)
        split_day += pd.Timedelta(1, 'D')
    return np.percentile(costs, 90)

### Now lets compare our models

In [6]:
zero_model = ZeroModel()
score = evaluate_model(zero_model)
print('ZeroModel score: {:.2f}'.format(score))

mean_model = MeanModel()
score = evaluate_model(mean_model)
print('MeanModel score: {:.2f}'.format(score))

ZeroModel score: 84.52
MeanModel score: 20.69


It looks that MeanModel is better (The lower the eror the better). 