<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Description" data-toc-modified-id="Data-Description-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Description</a></span></li><li><span><a href="#Useful-Scripts" data-toc-modified-id="Useful-Scripts-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Useful Scripts</a></span></li><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Load the data</a></span></li><li><span><a href="#Memory-Reduction" data-toc-modified-id="Memory-Reduction-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Memory Reduction</a></span></li><li><span><a href="#Take-Most-Visited-Page-as-Timeseries" data-toc-modified-id="Take-Most-Visited-Page-as-Timeseries-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Take Most Visited Page as Timeseries</a></span></li><li><span><a href="#Add-lag-columns" data-toc-modified-id="Add-lag-columns-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Add lag columns</a></span></li><li><span><a href="#Add-bias-term" data-toc-modified-id="Add-bias-term-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Add bias term</a></span></li><li><span><a href="#Modelling" data-toc-modified-id="Modelling-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Modelling</a></span></li><li><span><a href="#Train-Test-split" data-toc-modified-id="Train-Test-split-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Train Test split</a></span><ul class="toc-item"><li><span><a href="#Modelling:-Ensemble-Regressors" data-toc-modified-id="Modelling:-Ensemble-Regressors-9.1"><span class="toc-item-num">9.1&nbsp;&nbsp;</span>Modelling: Ensemble Regressors</a></span></li></ul></li></ul></div>

# Data Description

Reference: https://www.kaggle.com/c/web-traffic-time-series-forecasting/data

I have cleaned the kaggle wikipedia traffic data and selected only data of 2016 with 
fraction of 0.1.

The data was melted and additional columns were created.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(color_codes=True)

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
sns.set(context='notebook', style='whitegrid', rc={'figure.figsize': (12,8)})
plt.style.use('fivethirtyeight') # better than sns styles.
matplotlib.rcParams['figure.figsize'] = 12,8

import os
import time

# random state
random_state=100
np.random.seed(random_state)

# Jupyter notebook settings for pandas
#pd.set_option('display.float_format', '{:,.2g}'.format) # numbers sep by comma
from pandas.api.types import CategoricalDtype
np.set_printoptions(precision=3)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100) # None for all the rows
pd.set_option('display.max_colwidth', 200)

import IPython
from IPython.display import display, HTML, Image, Markdown

print([(x.__name__,x.__version__) for x in [np, pd,sns,matplotlib]])

[('numpy', '1.16.4'), ('pandas', '0.25.0'), ('seaborn', '0.9.0'), ('matplotlib', '3.1.1')]


In [2]:
import dask
import dask.dataframe as dd
import gc

In [3]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;

<IPython.core.display.Javascript object>

In [21]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import ExtraTreesRegressor

In [22]:
from sklearn.metrics import mean_absolute_error, r2_score

# Useful Scripts

In [4]:
def show_method_attributes(method, ncols=7,start=None):
    """ Show all the attributes of a given method.
    Example:
    ========
    show_method_attributes(list)
     """
    x = [I for I in dir(method) if I[0]!='_' ]
    x = [I for I in x 
         if I not in 'os np pd sys time psycopg2'.split() ]
    if start:
        x = [I for I in x if I.startswith(start)]

    return pd.DataFrame(np.array_split(x,ncols)).T.fillna('')

# Load the data

In [5]:
df = pd.read_csv('../../data/wiki/processed/data_cleaned_2016_frac01.csv',
                 parse_dates=['date'])

print(df.shape) # 5.3 million rows, 21 cols
df.head()

(5309196, 21)


Unnamed: 0,Page,date,visits,year,month,day,quarter,dayofweek,dayofyear,day_name,month_name,weekend,weekday,mean,median,name,project,access,agent,lang,language
0,Sean_Connery_en.wikipedia.org_desktop_all-agents,2016-01-01,4872,2016,1,1,1,4,1,Friday,January,False,True,3405.661202,2624.0,Sean_Connery,en.wikipedia.org,desktop,all-agents,en,English
1,Tableau_des_médailles_des_Jeux_olympiques_d'été_de_2008_fr.wikipedia.org_desktop_all-agents,2016-01-01,6,2016,1,1,1,4,1,Friday,January,False,True,170.84153,18.0,Tableau_des_médailles_des_Jeux_olympiques_d'été_de_2008,fr.wikipedia.org,desktop,all-agents,fr,French
2,The_Undertaker_fr.wikipedia.org_mobile-web_all-agents,2016-01-01,469,2016,1,1,1,4,1,Friday,January,False,True,400.336066,345.5,The_Undertaker,fr.wikipedia.org,mobile-web,all-agents,fr,French
3,Category:Outdoor_sex_commons.wikimedia.org_all-access_all-agents,2016-01-01,142,2016,1,1,1,4,1,Friday,January,False,True,205.174863,193.0,Category:Outdoor_sex,commons.wikimedia.org,all-access,all-agents,commons,Media
4,Камызяк_ru.wikipedia.org_all-access_all-agents,2016-01-01,6692,2016,1,1,1,4,1,Friday,January,False,True,912.516393,559.0,Камызяк,ru.wikipedia.org,all-access,all-agents,ru,Russian


# Memory Reduction

In [6]:
df.dtypes

Page                  object
date          datetime64[ns]
visits                 int64
year                   int64
month                  int64
day                    int64
quarter                int64
dayofweek              int64
dayofyear              int64
day_name              object
month_name            object
weekend                 bool
weekday                 bool
mean                 float64
median               float64
name                  object
project               object
access                object
agent                 object
lang                  object
language              object
dtype: object

In [7]:
df.memory_usage(deep=True).sum() * 1e-6 # MB

4069.15316

In [8]:
# all the year is 2016,drop it.

df.drop('year',axis=1,inplace=True)

In [9]:
cols_int = ['visits']
cols_cat = ['month','day','quarter','day_name','month_name',
            'project','access','agent','language']

cols_float = ['mean','median']

for c in cols_int:
    df[c] = df[c].astype(np.int32)
    
for c in cols_float:
    df[c] = df[c].astype(np.float32)
    

for c in cols_cat:
    df[c] = df[c].astype(pd.api.types.CategoricalDtype())

In [10]:
df.memory_usage(deep=True).sum() * 1e-6 # MB

1777.2233549999999

# Take Most Visited Page as Timeseries

In [11]:
df.head()

Unnamed: 0,Page,date,visits,month,day,quarter,dayofweek,dayofyear,day_name,month_name,weekend,weekday,mean,median,name,project,access,agent,lang,language
0,Sean_Connery_en.wikipedia.org_desktop_all-agents,2016-01-01,4872,1,1,1,4,1,Friday,January,False,True,3405.661133,2624.0,Sean_Connery,en.wikipedia.org,desktop,all-agents,en,English
1,Tableau_des_médailles_des_Jeux_olympiques_d'été_de_2008_fr.wikipedia.org_desktop_all-agents,2016-01-01,6,1,1,1,4,1,Friday,January,False,True,170.841537,18.0,Tableau_des_médailles_des_Jeux_olympiques_d'été_de_2008,fr.wikipedia.org,desktop,all-agents,fr,French
2,The_Undertaker_fr.wikipedia.org_mobile-web_all-agents,2016-01-01,469,1,1,1,4,1,Friday,January,False,True,400.33606,345.5,The_Undertaker,fr.wikipedia.org,mobile-web,all-agents,fr,French
3,Category:Outdoor_sex_commons.wikimedia.org_all-access_all-agents,2016-01-01,142,1,1,1,4,1,Friday,January,False,True,205.174866,193.0,Category:Outdoor_sex,commons.wikimedia.org,all-access,all-agents,commons,Media
4,Камызяк_ru.wikipedia.org_all-access_all-agents,2016-01-01,6692,1,1,1,4,1,Friday,January,False,True,912.516418,559.0,Камызяк,ru.wikipedia.org,all-access,all-agents,ru,Russian


In [12]:
# top pages per language
df.groupby('language')['visits'].apply(lambda x: df.loc[x.nlargest(1).index])

Unnamed: 0_level_0,Unnamed: 1_level_0,Page,date,visits,month,day,quarter,dayofweek,dayofyear,day_name,month_name,weekend,weekday,mean,median,name,project,access,agent,lang,language
language,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Chinese,3526717,緋彈的亞莉亞角色列表_zh.wikipedia.org_desktop_all-agents,2016-08-31,243557,8,31,3,2,244,Wednesday,August,False,True,813.0765,130.0,緋彈的亞莉亞角色列表,zh.wikipedia.org,desktop,all-agents,zh,Chinese
English,2714919,Special:Search_en.wikipedia.org_desktop_all-agents,2016-07-06,16592075,7,6,3,2,188,Wednesday,July,False,True,1845918.0,1700576.5,Special:Search,en.wikipedia.org,desktop,all-agents,en,English
French,2163034,Wikipédia:Accueil_principal_fr.wikipedia.org_all-access_all-agents,2016-05-29,1845404,5,29,2,6,150,Sunday,May,True,False,1588652.0,1601521.0,Wikipédia:Accueil_principal,fr.wikipedia.org,all-access,all-agents,fr,French
German,4439792,Gerätestecker_de.wikipedia.org_desktop_all-agents,2016-11-02,558381,11,2,4,2,307,Wednesday,November,False,True,1870.743,398.0,Gerätestecker,de.wikipedia.org,desktop,all-agents,de,German
Japanese,2724137,デイヴィッド・ロックフェラー_ja.wikipedia.org_all-access_all-agents,2016-07-06,1651272,7,6,3,2,188,Wednesday,July,False,True,8011.582,143.0,デイヴィッド・ロックフェラー,ja.wikipedia.org,all-access,all-agents,ja,Japanese
Media,1865460,Parsoid/Developer_Setup_www.mediawiki.org_all-access_all-agents,2016-05-08,927825,5,8,2,6,129,Sunday,May,True,False,12753.89,60.0,Parsoid/Developer_Setup,www.mediawiki.org,all-access,all-agents,www,Media
Russian,4179962,Служебная:Поиск_ru.wikipedia.org_all-access_all-agents,2016-10-15,1412292,10,15,4,5,289,Saturday,October,True,False,179811.9,171580.0,Служебная:Поиск,ru.wikipedia.org,all-access,all-agents,ru,Russian
Spanish,2376682,Nilo_es.wikipedia.org_desktop_all-agents,2016-06-12,783454,6,12,2,6,164,Sunday,June,True,False,2780.18,628.5,Nilo,es.wikipedia.org,desktop,all-agents,es,Spanish


In [13]:
idx = df.groupby('Page')['visits'].sum().idxmax()
idx

'Special:Search_en.wikipedia.org_desktop_all-agents'

In [14]:
df.query(""" Page == @idx """).head()

Unnamed: 0,Page,date,visits,month,day,quarter,dayofweek,dayofyear,day_name,month_name,weekend,weekday,mean,median,name,project,access,agent,lang,language
2297,Special:Search_en.wikipedia.org_desktop_all-agents,2016-01-01,1401667,1,1,1,4,1,Friday,January,False,True,1845918.125,1700576.5,Special:Search,en.wikipedia.org,desktop,all-agents,en,English
16803,Special:Search_en.wikipedia.org_desktop_all-agents,2016-01-02,1395136,1,2,1,5,2,Saturday,January,True,False,1845918.125,1700576.5,Special:Search,en.wikipedia.org,desktop,all-agents,en,English
31309,Special:Search_en.wikipedia.org_desktop_all-agents,2016-01-03,1455522,1,3,1,6,3,Sunday,January,True,False,1845918.125,1700576.5,Special:Search,en.wikipedia.org,desktop,all-agents,en,English
45815,Special:Search_en.wikipedia.org_desktop_all-agents,2016-01-04,1750373,1,4,1,0,4,Monday,January,False,True,1845918.125,1700576.5,Special:Search,en.wikipedia.org,desktop,all-agents,en,English
60321,Special:Search_en.wikipedia.org_desktop_all-agents,2016-01-05,1787494,1,5,1,1,5,Tuesday,January,False,True,1845918.125,1700576.5,Special:Search,en.wikipedia.org,desktop,all-agents,en,English


In [15]:
df.columns

Index(['Page', 'date', 'visits', 'month', 'day', 'quarter', 'dayofweek',
       'dayofyear', 'day_name', 'month_name', 'weekend', 'weekday', 'mean',
       'median', 'name', 'project', 'access', 'agent', 'lang', 'language'],
      dtype='object')

In [16]:
cols_drop = ['Page','day_name', 'month_name',  'weekday',
             'mean', 'median','dayofyear',
             'name', 'project', 'access', 'agent', 'lang', 'language']

data = df.query(""" Page == @idx """).drop(cols_drop,1).set_index('date')

print(data.shape)
data.head()

(366, 6)


Unnamed: 0_level_0,visits,month,day,quarter,dayofweek,weekend
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-01-01,1401667,1,1,1,4,False
2016-01-02,1395136,1,2,1,5,True
2016-01-03,1455522,1,3,1,6,True
2016-01-04,1750373,1,4,1,0,False
2016-01-05,1787494,1,5,1,1,False


# Add lag columns

In [17]:
for lag in range(1,8):
    data['lag'+str(lag)] = data['visits'].shift(lag)

In [18]:
data.head()

Unnamed: 0_level_0,visits,month,day,quarter,dayofweek,weekend,lag1,lag2,lag3,lag4,lag5,lag6,lag7
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2016-01-01,1401667,1,1,1,4,False,,,,,,,
2016-01-02,1395136,1,2,1,5,True,1401667.0,,,,,,
2016-01-03,1455522,1,3,1,6,True,1395136.0,1401667.0,,,,,
2016-01-04,1750373,1,4,1,0,False,1455522.0,1395136.0,1401667.0,,,,
2016-01-05,1787494,1,5,1,1,False,1750373.0,1455522.0,1395136.0,1401667.0,,,


In [19]:
data = data.dropna()
data.head()

Unnamed: 0_level_0,visits,month,day,quarter,dayofweek,weekend,lag1,lag2,lag3,lag4,lag5,lag6,lag7
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2016-01-08,1804425,1,8,1,4,False,1972186.0,1952324.0,1787494.0,1750373.0,1455522.0,1395136.0,1401667.0
2016-01-09,1483316,1,9,1,5,True,1804425.0,1972186.0,1952324.0,1787494.0,1750373.0,1455522.0,1395136.0
2016-01-10,1576497,1,10,1,6,True,1483316.0,1804425.0,1972186.0,1952324.0,1787494.0,1750373.0,1455522.0
2016-01-11,1959763,1,11,1,0,False,1576497.0,1483316.0,1804425.0,1972186.0,1952324.0,1787494.0,1750373.0
2016-01-12,1903329,1,12,1,1,False,1959763.0,1576497.0,1483316.0,1804425.0,1972186.0,1952324.0,1787494.0


# Add bias term

In [51]:
data['bias'] = 1
data.head()

Unnamed: 0_level_0,visits,month,day,quarter,dayofweek,weekend,lag1,lag2,lag3,lag4,lag5,lag6,lag7,bias
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2016-01-08,1804425,1,8,1,4,False,1972186.0,1952324.0,1787494.0,1750373.0,1455522.0,1395136.0,1401667.0,1
2016-01-09,1483316,1,9,1,5,True,1804425.0,1972186.0,1952324.0,1787494.0,1750373.0,1455522.0,1395136.0,1
2016-01-10,1576497,1,10,1,6,True,1483316.0,1804425.0,1972186.0,1952324.0,1787494.0,1750373.0,1455522.0,1
2016-01-11,1959763,1,11,1,0,False,1576497.0,1483316.0,1804425.0,1972186.0,1952324.0,1787494.0,1750373.0,1
2016-01-12,1903329,1,12,1,1,False,1959763.0,1576497.0,1483316.0,1804425.0,1972186.0,1952324.0,1787494.0,1


# Modelling

# Train Test split

In [52]:
frac = int(len(data)*0.7)

Xtrain = data.drop('visits',1).iloc[:frac, :].astype(np.int32).to_numpy()
Xtest = data.drop('visits',1).iloc[frac:, :].astype(np.int32).to_numpy()

ytrain = data['visits'].iloc[:frac].to_numpy()
ytest = data['visits'].iloc[frac:].to_numpy()

Xtrain[0], Xtest[0], ytrain[0], ytest[0],ytrain[-1],ytest[-1]

(array([      1,       8,       1,       4,       0, 1972186, 1952324,
        1787494, 1750373, 1455522, 1395136, 1401667,       1], dtype=int32),
 array([      9,      15,       3,       3,       0, 1728053, 1814001,
        1731404, 1464487, 1408230, 1633931, 1758004,       1], dtype=int32),
 1804425,
 1729568,
 1728053,
 1175657)

In [53]:
data.iloc[[0,frac,frac+1,-1],:]

Unnamed: 0_level_0,visits,month,day,quarter,dayofweek,weekend,lag1,lag2,lag3,lag4,lag5,lag6,lag7,bias
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2016-01-08,1804425,1,8,1,4,False,1972186.0,1952324.0,1787494.0,1750373.0,1455522.0,1395136.0,1401667.0,1
2016-09-15,1729568,9,15,3,3,False,1728053.0,1814001.0,1731404.0,1464487.0,1408230.0,1633931.0,1758004.0,1
2016-09-16,1653645,9,16,3,4,False,1729568.0,1728053.0,1814001.0,1731404.0,1464487.0,1408230.0,1633931.0,1
2016-12-31,1175657,12,31,4,5,True,1397331.0,1455447.0,1399599.0,1481319.0,1358883.0,1030746.0,1088418.0,1


In [None]:
ts = data['visits']

## Modelling: Ensemble Regressors

In [54]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor

In [55]:
reg =AdaBoostRegressor(n_estimators = 5000, random_state = random_state, learning_rate=0.01)

In [56]:
reg.fit(Xtrain,ytrain)

AdaBoostRegressor(base_estimator=None, learning_rate=0.01, loss='linear',
                  n_estimators=5000, random_state=100)

In [57]:
ypreds = reg.predict(Xtest)

In [58]:
ypreds[:5]

array([1745540.32 , 1716665.386, 1660556.582, 1657822.064, 1832404.09 ])

In [59]:
r2 = r2_score(ytest, ypreds)

In [60]:
r2

-0.17162008952590346

In [44]:
ypreds.shape

(108,)

In [45]:
ytest.shape

(108,)