# Homework 3 

You will have 2 datasets to work on. 
#### You have to answer questions in this [form](https://goo.gl/forms/5gfxvKZxpoydoeOB2) and provide your code

### 1. Wikipedia Web Traffic Time Series

Data from [Kaggle competition](https://www.kaggle.com/c/web-traffic-time-series-forecasting)* )

*wikipedia_train3* - train data *wikipedia_test3* - test data created by us from original train data . For more information about dataset, please visit Homework1 assignment


## Wikipedia page views (SMAPE metric)

In [29]:
import os
import re
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression, Lasso, Ridge

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

from dateutil.relativedelta import relativedelta

In [2]:
train = pd.read_csv("../data/wikipedia3/wikipedia_train3.csv", parse_dates = ['date'])
test = pd.read_csv("../data/wikipedia3/wikipedia_test3.csv", parse_dates = ['date'])

In [3]:
train.sample(3)

Unnamed: 0,Page,date,Visits
774311,Wikipedia_de.wikipedia.org_desktop_all-agents,2016-03-06,5088.0
1019461,Lyle_and_Erik_Menendez_en.wikipedia.org_all-ac...,2016-03-27,2665.0
1271216,Illuminati_en.wikipedia.org_mobile-web_all-agents,2016-04-17,8448.0


**1.** Take a look carefuly at train and test dataset. Note for you what is the difference in them and how they are dependent. **For _only_ train** create a holdout validation using any type of split you think is useful here. What is the split type you are using? Answer in google forms

**2.** Write a code to compare the score of your validation and test set. For scoring, use metric SMAPE (code is in lecture). For prediction use 15 previous days median. In the google form write your validation score.

#### SMAPE METRIC

In [4]:
def SMAPE(forecast, actual):
    if forecast.size != actual.size:
        raise ValueError('Forecast and actual data have different dimensions: F.size = {}, A.size = {}'.format(forecast.size, actual.size))
    
    pure_forecast = forecast.fillna(0)
    
    error = (pure_forecast - actual).abs()
    avarage = (pure_forecast.abs() + actual.abs())/2
    smape_series = error/avarage
    
    smape_series.fillna(0, inplace = True)
    return 100*smape_series.mean()

#### Time range

In [5]:
min_train_date = train['date'].min()
max_train_date = train['date'].max()

min_test_date = test['date'].min()
max_test_date = test['date'].max()

print('TRAIN: From {} to {}'.format(min_train_date.date(), max_train_date.date()))
print('TEST: From {} to {}'.format(min_test_date.date(), max_test_date.date()))

TRAIN: From 2016-01-01 to 2016-08-31
TEST: From 2016-09-10 to 2016-11-10


#### Holdout validation

As we use prediction based on median of last N days and our test set consists of data within a period of next two monthes It would be reasonable to split data into 4 parts:

+ holdout with size of two next months
+ gap of one month and 10 days to drop
+ train set with N last days
+ other data to drop 

Split into holdout and train only:

In [6]:
last_N_days = 15

In [7]:
holdout_date = max_train_date - relativedelta(months = 2)
train_holdout_gap = relativedelta(months = 1, days = 10)
train_date = holdout_date - train_holdout_gap - relativedelta(days = last_N_days)

holdout_indices = train.date >= holdout_date
train_inices = (train_date <= train.date) & (train.date < holdout_date)

holdout = train[holdout_indices]
train_pure = train[train_inices]

In [8]:
def get_prediction(train, test):
    test_prediction = test.merge(
        train.groupby(['Page']).median().reset_index().rename({'Visits': 'prediction'}, axis = 'columns'), 
        on = ['Page'], how = 'left')
    
    return test_prediction

In [9]:
def get_prediction_SMAPE(train, test):
    test_prediction = get_prediction(train, test)
    return SMAPE(test_prediction['prediction'], test_prediction['Visits'])

In [10]:
holdout_smape = get_prediction_SMAPE(train_pure, holdout)
print('SMAPE for holdout data over the last 2 months and train data over preceding 15 days: {}'.format(holdout_smape))

SMAPE for holdout data over the last 2 months and train data over preceding 15 days: 44.934277560873


#### Test validation

In [11]:
train_date = max_train_date - relativedelta(days = 15)
train_inices = train_date <= train.date
train_pure = train[train_inices]

In [12]:
test_smape = get_prediction_SMAPE(train_pure, test)
print('SMAPE for holdout data over the next 2 months and train data over last 15 days: {}'.format(test_smape))

SMAPE for holdout data over the next 2 months and train data over last 15 days: 40.207352322566344


**3.** Perform K-fold validation using your type of split. Run GridSearch with any classificator you like and set of parameters to optimize, providing it with your custom validation.  Compare the score of your validation and test set. For scoring, again, use metrics SMAPE. In the google form write your scores on validation and test sets

#### Choose the best 'last N' parametr  using floating window, 5 folds

In [13]:
train_window = 2 #months
holdout_window = 2 #months

folds_number = 5
folds_period = relativedelta(max_train_date, min_train_date) - relativedelta(months = train_window + holdout_window) - train_holdout_gap
window_step = (folds_period.days + folds_period.months * 30) // folds_number

In [14]:
def create_validation(df, start_date):
    holdout_start = start_date + relativedelta(months = train_window) + train_holdout_gap
    holdout_end = holdout_start + relativedelta(months = holdout_window)
    
    if holdout_end > max_train_date:
        raise ValueError('holdout_window + train_window + start_date should be < than max_train_date. We want to use floating window with train size {} months and holdout size {} months.'.format(train_window, holdout_window))
 
    train_indices = (df.date >= start_date) & (df.date < holdout_start)
    holdout_indices = (df.date >= holdout_start) & (df.date < holdout_end)

    return train_indices, holdout_indices


In [15]:
train_start_dates = [min_train_date + relativedelta(days = window_step * i) for i in range(folds_number)]

In [16]:
print('Cross validation from {} to {}:'.format(min_train_date.date(), max_train_date.date()))
for i, train_start in enumerate(train_start_dates):
    train_end = train_start + relativedelta(months = train_window)
    holdout_start = train_end + train_holdout_gap
    holdout_end = holdout_start + relativedelta(months = holdout_window)
    print('{}.Train: from {} to {},'.format(i, train_start.date(), train_end.date()))
    print('  Holdout: from {} to {}'.format(holdout_start.date(), holdout_end.date()))

Cross validation from 2016-01-01 to 2016-08-31:
0.Train: from 2016-01-01 to 2016-03-01,
  Holdout: from 2016-04-11 to 2016-06-11
1.Train: from 2016-01-17 to 2016-03-17,
  Holdout: from 2016-04-27 to 2016-06-27
2.Train: from 2016-02-02 to 2016-04-02,
  Holdout: from 2016-05-12 to 2016-07-12
3.Train: from 2016-02-18 to 2016-04-18,
  Holdout: from 2016-05-28 to 2016-07-28
4.Train: from 2016-03-05 to 2016-05-05,
  Holdout: from 2016-06-15 to 2016-08-15


In [17]:
CVIterator = []
for i in train_start_dates:
    train_indices, holdout_indices = create_validation(train, i)
    CVIterator.append((train_indices, holdout_indices))

In [18]:
test_last_N_days = [1, 2, 3, 5, 8, 13, 21, 34, 55]

result = {}
for i, N in enumerate(test_last_N_days):
    smapes = []
    for train_indices, holdout_indices in CVIterator:
        current_train = train[train_indices]
        current_holdout = train[holdout_indices]
        
        actual_N_train_start = current_train.date.max() - relativedelta(days = N)
        actual_N_train = current_train[current_train.date >= actual_N_train_start]
        
        current_smape = get_prediction_SMAPE(actual_N_train, current_holdout)
        smapes.append(current_smape)
    
    smapes_series = pd.Series(smapes)
    print('Mean SMAPE for last {} days: {}'.format(N, smapes_series.mean()))
    result[N] = smapes_series

Mean SMAPE for last 1 days: 42.83186996075777
Mean SMAPE for last 2 days: 41.88165552374041
Mean SMAPE for last 3 days: 41.575264213769145
Mean SMAPE for last 5 days: 40.8600628446685
Mean SMAPE for last 8 days: 41.50156879388365
Mean SMAPE for last 13 days: 40.985370072241366
Mean SMAPE for last 21 days: 41.75113354602299
Mean SMAPE for last 34 days: 42.73413688065837
Mean SMAPE for last 55 days: 44.19190516427825


We get the best SMAPE using 5 last days for prediction.

In [19]:
last_N_days = 5
last_N_days_train = train[train.date >= max_train_date - relativedelta(days = last_N_days)]
test_smape = get_prediction_SMAPE(last_N_days_train, test)
print('SMAPE for test data for the next 2 months using train data over the last {} days: {}'.format(last_N_days, test_smape))

SMAPE for test data for the next 2 months using train data over the last 5 days: 40.14220878443605


As we can see we got even less SMAPE using 10 last days to predict visits for the next 2 months. We get this hyperparameter on cross validation, so it more robust to overfitting.

#### Use linear regression

In [20]:
def get_language(page):
    res = re.search('[a-z][a-z].wikipedia.org', page)
    if res:
        return res.group(0)[0:2]
    return 'na'

In [21]:
def transform(x):
    x['lang'] = get_language(x['Page'])
    x['isweekend'] = 1 if x.date.weekday() >= 5 else 0
    
    return x

In [22]:
X_train_trasnformed = train.apply(transform, axis = 'columns').drop(['date', 'Page', 'Visits'], axis = 'columns')

X_train = pd.get_dummies(X_train_trasnformed)
y_train = train['Visits']

In [23]:
X_test_trasnformed = test.apply(transform, axis = 'columns').drop(['date', 'Page', 'Visits'], axis = 'columns')

X_test = pd.get_dummies(X_test_trasnformed)
y_test = test['Visits']

In [24]:
X_train.sample(5)

Unnamed: 0,isweekend,lang_de,lang_en,lang_es,lang_fr,lang_ja,lang_na,lang_ru,lang_zh
2087522,1,0,1,0,0,0,0,0,0
2443888,0,0,0,1,0,0,0,0,0
2722790,0,0,1,0,0,0,0,0,0
2729837,0,0,1,0,0,0,0,0,0
1044285,0,0,1,0,0,0,0,0,0


In [25]:
def map_bool_iterator(train_bool_indices, holdout_bool_indices):
    new_train_indices = train_bool_indices[train_bool_indices].index
    new_holdout_indices = holdout_bool_indices[holdout_bool_indices].index

    return new_train_indices, new_holdout_indices

In [26]:
index_CVIterator = []

for train_indices, holdout_indices in CVIterator:
    new_train_indices, new_holdout_indices = map_bool_iterator(train_indices, holdout_indices)
    index_CVIterator.append((new_train_indices, new_holdout_indices))

In [27]:
def scorer_function(y_true, y_pred):
    pred = pd.Series(y_pred).reset_index(drop = True)
    act = pd.Series(y_true).reset_index(drop = True)
    smape = SMAPE(pred, act)
    return smape

scorer = make_scorer(scorer_function, greater_is_better = False)

In [49]:
regressor = Ridge()

grid = {
    'alpha': [0.1, 1, 3, 5, 10, 50]
}
grid_search = GridSearchCV(
    regressor, 
    param_grid = grid, 
    cv = index_CVIterator, 
    scoring = scorer,
    n_jobs = os.cpu_count()//2,
    return_train_score = True
)
grid_search.fit(X_train, y_train)

GridSearchCV(cv=[(Int64Index([      0,       1,       2,       3,       4,       5,       6,
                  7,       8,       9,
            ...
            1193305, 1193306, 1193307, 1193308, 1193309, 1193310, 1193311,
            1193312, 1193313, 1193314],
           dtype='int64', length=1193315), Int64In...2682000, 2682001,
            2682002, 2682003, 2682004],
           dtype='int64', length=720715))],
       error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params=None, iid=True, n_jobs=4,
       param_grid={'alpha': [0.1, 1, 3, 5, 10, 50]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(scorer_function, greater_is_better=False),
       verbose=0)

In [50]:
prediction = pd.Series(grid_search.best_estimator_.predict(X_test))

In [51]:
test_regression_smape = SMAPE(prediction, y_test)
print('SMAPE for test data for the next 2 months using Ridge regression: {}'.format(test_regression_smape))

SMAPE for test data for the next 2 months using Ridge regression: 112.22684934380054


In [52]:
linear_regressor = LinearRegression()
linear_regressor.fit(X_train, y_train)

prediction = pd.Series(linear_regressor.predict(X_test))

test_regression_smape = SMAPE(prediction, y_test)
print('SMAPE for test data for the next 2 months using Linear regression: {}'.format(test_regression_smape))

SMAPE for test data for the next 2 months using Linear regression: 112.22621967099843


Linear models are poor here...