<html>
<font color = green size = 6>
<b>
Time Series Forecasting of hourly data using Holt Winters model
</b>
</font>
</html>

<html>
<font color = blue>
Created a package to calculate Holt Winters Forecasting. <br />
Install using the following command: <br />
pip install <b> sulekha_holtwinters </b>
</font>
</html>

In [2]:
from sulekha_holtwinters import holtwinters as hw

In [3]:
hd = hw.holtwinters()

<html>
<font color = Purple size = 4>
<b> Set up the Pyspark environment required for running the model</b>
</font>
</html>

In [4]:
#Pyspark setup
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import Row, SparkSession
from pyspark.sql.types import NumericType

In [5]:
sc = SparkContext.getOrCreate()
spark = SQLContext(sc)

<html>
<font color = Red size = 6>
<b> Feature Engineering</b>
</font>
</html>

<html>
<font color = blue>
Load cleaned up log files as a Spark dataframe.
</font>
</html>

In [7]:
newlogs_df = spark.read.csv(path = "/home/sulekhadileep/Documents/newlogs.csv", header = True,inferSchema = True)

<html>
<font color = blue>
Data is in minutes for each AvatarID. So, first count AvatarIDs by grouping them into minutes.
</font>
</html>

In [8]:
df2 = newlogs_df.groupBy(['QueryTime']).count()

<html>
<font color = blue>
Strip minutes and seconds from Query Time.
</font>
</html>

In [9]:
df2 = df2.withColumn('DateTime', df2['QueryTime'].substr(1,12))

In [10]:
df2

DataFrame[QueryTime: string, count: bigint, DateTime: string]

<html>
<font color = blue>
Now, take sum of players by grouping them by hours.
</font>
</html>

In [11]:
df3 = df2.groupBy(['DateTime']).agg({"count": "sum"})

In [12]:
df3

DataFrame[DateTime: string, sum(count): bigint]

In [13]:
df3 = df3.withColumn('Date', df2['DateTime'].substr(1,9))

In [14]:
df3 = df3.withColumnRenamed('sum(count)','PlayersCount')

In [15]:
df3

DataFrame[DateTime: string, PlayersCount: bigint, Date: string]

<html>
<font color = blue>
Create a user defined function to get weekday for each Date since we are going to split data set by Week days and then do forecasting.
</font>
</html>

In [16]:
def get_weekday(date):
    import datetime
    import calendar
    month, day, year = (int(x) for x in date.split('/'))    
    weekday = datetime.date(year, month, day)
    return calendar.day_name[weekday.weekday()]

In [17]:
spark.udf.register('get_weekday', get_weekday)

In [18]:
df3.createOrReplaceTempView("weekdays")

In [19]:
df4 = spark.sql("select DateTime, PlayersCount, get_weekday(Date) as Weekday from weekdays order by DateTime")

In [20]:
df4

DataFrame[DateTime: string, PlayersCount: bigint, Weekday: string]

In [21]:
df4.createOrReplaceTempView("completedata")

<html>
<font color = blue>
<b> Collate all data for Sundays in a dataframe df_sunday </b>
</font>
</html>

In [22]:
df_sunday = spark.sql("select DateTime, PlayersCount from completedata where Weekday = 'Sunday' order by DateTime")

<html>
<font color = blue>
<b> Collate all data for Mondays in a dataframe df_monday </b>
</font>
</html>

In [93]:
df_monday = spark.sql("select DateTime, PlayersCount from completedata where Weekday = 'Monday' order by DateTime")

<html>
<font color = blue>
<b> Repeat the above two steps for all weekdays </b>
</font>
</html>

In [94]:
df_tuesday = spark.sql("select DateTime, PlayersCount from completedata where Weekday = 'Tuesday' order by DateTime")

In [95]:
df_wednesday = spark.sql("select DateTime, PlayersCount from completedata where Weekday = 'Wednesday' order by DateTime")

In [96]:
df_thursday = spark.sql("select DateTime, PlayersCount from completedata where Weekday = 'Thursday' order by DateTime")

In [98]:
df_friday = spark.sql("select DateTime, PlayersCount from completedata where Weekday = 'Friday' order by DateTime")

In [97]:
df_saturday = spark.sql("select DateTime, PlayersCount from completedata where Weekday = 'Saturday' order by DateTime")

<html>
<font color = blue>
<b> To avoid processing this complete data of more than 40 million records multiple times, let us save the intermediate results as csv and use it for forecasting. </b>
</font>
</html>

df_sunday.repartition(1).write.csv('timeseries_wow_hourly_sunday.csv')

df_monday.repartition(1).write.csv('timeseries_wow_hourly_monday.csv')

df_tuesday.repartition(1).write.csv('timeseries_wow_hourly_tuesday.csv')

df_wednesday.repartition(1).write.csv('timeseries_wow_hourly_wednesday.csv')

df_thursday.repartition(1).write.csv('timeseries_wow_hourly_thursday.csv')

df_friday.repartition(1).write.csv('timeseries_wow_hourly_friday.csv')

df_saturday.repartition(1).write.csv('timeseries_wow_hourly_saturday.csv')

<html>
<font color = Red size = 6>
<b> Time series Forecasting model building </b>
</font>
</html>

<html>
<font color = blue size = 6>
<b> Sunday Forecast</b>
</font>
</html>

In [8]:
df_sunday = spark.read.csv(path = "timeseries_wow_hourly_sunday.csv", header = True,inferSchema = True)

<html>
<font color = blue>
<b> Try various values for the optimization parameters - alpha, beta and gamma</b>
</font>
</html>

Trial 1:

In [9]:
Observed, Predictions, Level, Trend, Seasonality = hd.holtwinters_additive(df_sunday,0.05,0.06,0.07,24,24)

<html>
<font color = blue>
<b> Calculate Mean Average Percentage Error between actual and forecasted values</b>
</font>
</html>

In [10]:
hd.MAPE(Observed, Predictions)

36.568545344429694

Trail 2:

In [11]:
Observed, Predictions, Level, Trend, Seasonality = hd.holtwinters_additive(df_sunday,0.15,0.16,0.17,24,24)

In [12]:
hd.MAPE(Observed, Predictions)

20.34377185289498

Trial 3:

In [13]:
Observed, Predictions, Level, Trend, Seasonality = hd.holtwinters_additive(df_sunday,0.145,0.146,0.247,24,24)

In [14]:
hd.MAPE(Observed, Predictions)

18.875780664814123

Trial 4:

In [15]:
Observed, Predictions, Level, Trend, Seasonality = hd.holtwinters_additive(df_sunday,0.865,0.01,0.865,24,24)

In [16]:
hd.MAPE(Observed, Predictions)

0.7025811154520147

<html>
<font color = blue>
<b> Use the below function to find out the most suitable combination of alpha, beta and gamma</b>
</font>
</html>

In [17]:
Accuracy, Alpha, Beta, Gamma = hd.BestFit_Additive(df_sunday, interval=865, denominator=1000,L = 24,n_predictions = 24)

In [18]:
Accuracy

89.18893460158974

In [19]:
Alpha,Beta, Gamma

(0.865, 0.0, 0.865)

<html>
<font color = blue>
<b> Create graphs to view the quality of Trend, Level, Sesonality, Predictions and Observations</b>
</font>
</html>

In [20]:
model1 = hd.holtwinters_additive(df_sunday,0.865,0.01,0.865,24,24)
hd.CreateGraphs(model1)

<html>
<font color = blue>
<b> Plot last 24 predictions in comparison to last 24 observations to check the quality of prediction model</b>
</font>
</html>

In [21]:
def plot(Observed,Predictions):
    from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
    from plotly import tools
    init_notebook_mode(connected=True) #Plotly offline
    from plotly import graph_objs as go
    from plotly import figure_factory
    #input
    Observed = Observed
    Predictions = Predictions
    observed = go.Scatter(y = Observed )
    predictions = go.Scatter(y = Predictions)
    data = [observed, predictions]
    fig = tools.make_subplots(rows=2, cols=1, subplot_titles = ("Observed[t-24]","Predictions[t+24]"),print_grid=False)
    fig.append_trace(observed, 1, 1)
    fig.append_trace(predictions, 2, 1)
    fig['layout'].update(height=600, width=900, title='Holt Winters Plot', showlegend=False)
    return iplot(fig)    

In [22]:
plot(Observed[-24:], Predictions[-24:])

<html>
<font color = blue size = 6>
<b> Monday Forecast</b>
</font>
</html>

In [23]:
df_monday = spark.read.csv(path = "timeseries_wow_hourly_monday.csv", header = True,inferSchema = True)

In [24]:
Observed, Predictions, Level, Trend, Seasonality = hd.holtwinters_additive(df_monday,0.865,0.01,0.865,24,24)
hd.MAPE(Observed, Predictions)

0.47700445363962657

In [25]:
hd.BestFit_Additive(df_monday, interval=865, denominator=1000,L = 24,n_predictions = 24)

(99.10889897302785, 0.865, 0.0, 0.865)

In [26]:
model1 = hd.holtwinters_additive(df_monday,0.865,0.01,0.865,24,24)
hd.CreateGraphs(model1)

<html>
<font color = blue size = 6>
<b> Tuesday Forecast</b>
</font>
</html>

In [27]:
df_tuesday = spark.read.csv(path = "timeseries_wow_hourly_tuesday.csv", header = True,inferSchema = True)
Observed, Predictions, Level, Trend, Seasonality = hd.holtwinters_additive(df_tuesday,0.865,0.01,0.865,24,24)
hd.MAPE(Observed, Predictions)

hd.BestFit_Additive(df_tuesday, interval=865, denominator=1000,L = 24,n_predictions = 24)

model1 = hd.holtwinters_additive(df_tuesday,0.865,0.01,0.865,24,24)
hd.CreateGraphs(model1)

<html>
<font color = blue size = 6>
<b> Wednesday Forecast</b>
</font>
</html>

In [29]:
df_wednesday = spark.read.csv(path = "timeseries_wow_hourly_wednesday.csv", header = True,inferSchema = True)
Observed, Predictions, Level, Trend, Seasonality = hd.holtwinters_additive(df_wednesday,0.865,0.01,0.865,24,24)
hd.MAPE(Observed, Predictions)

hd.BestFit_Additive(df_wednesday, interval=865, denominator=1000,L = 24,n_predictions = 24)

model1 = hd.holtwinters_additive(df_wednesday,0.865,0.01,0.865,24,24)
hd.CreateGraphs(model1)

<html>
<font color = blue size = 6>
<b> Thursday Forecast</b>
</font>
</html>

In [30]:
df_thursday = spark.read.csv(path = "timeseries_wow_hourly_thursday.csv", header = True,inferSchema = True)
Observed, Predictions, Level, Trend, Seasonality = hd.holtwinters_additive(df_thursday,0.865,0.01,0.865,24,24)
hd.MAPE(Observed, Predictions)

hd.BestFit_Additive(df_thursday, interval=865, denominator=1000,L = 24,n_predictions = 24)

model1 = hd.holtwinters_additive(df_thursday,0.865,0.01,0.865,24,24)
hd.CreateGraphs(model1)

<html>
<font color = blue size = 6>
<b> Friday Forecast</b>
</font>
</html>

In [31]:
df_friday = spark.read.csv(path = "timeseries_wow_hourly_friday.csv", header = True,inferSchema = True)
Observed, Predictions, Level, Trend, Seasonality = hd.holtwinters_additive(df_friday,0.865,0.01,0.865,24,24)
hd.MAPE(Observed, Predictions)

hd.BestFit_Additive(df_friday, interval=865, denominator=1000,L = 24,n_predictions = 24)

model1 = hd.holtwinters_additive(df_friday,0.865,0.01,0.865,24,24)
hd.CreateGraphs(model1)

<html>
<font color = blue size = 6>
<b> Saturday Forecast</b>
</font>
</html>

In [32]:
df_saturday = spark.read.csv(path = "timeseries_wow_hourly_saturday.csv", header = True,inferSchema = True)
Observed, Predictions, Level, Trend, Seasonality = hd.holtwinters_additive(df_saturday,0.865,0.01,0.865,24,24)
hd.MAPE(Observed, Predictions)

hd.BestFit_Additive(df_saturday, interval=865, denominator=1000,L = 24,n_predictions = 24)

model1 = hd.holtwinters_additive(df_saturday,0.865,0.01,0.865,24,24)
hd.CreateGraphs(model1)

<html>
<font color = blue>
<b> Verify RMSE of n periods predicted in future </b>
</font>
</html>

<html>
<font color = blue>
To perform this check, data for one particular day (24 timepoints) is removed from the dataset and the remaining data is tested with Holt Winters forecasting and a future prediction period of n = 24 timepoints.
</font>
</html>

In [49]:
train = spark.read.csv(path = "timeseries_wow_hourly_train.csv", header = True,inferSchema = True)
test = spark.read.csv(path = "timeseries_wow_hourly_test.csv", header = True,inferSchema = True)

In [50]:
Observed, Predictions, Level, Trend, Seasonality = hd.holtwinters_additive(train,0.9,0.009,0.865,24,24)

In [51]:
observed = [p.PlayersCount for p in test.select("PlayersCount").collect()]

In [52]:
preds = Predictions[-24:]

<html>
<font color = blue>
<b> Forecast RMSE</b>
</font>
</html>

In [57]:
hd.Accuracy(Observed, Predictions)

99.34744290252247

In [53]:
hd.MAPE(Observed, Predictions)

0.6525570974775063

In [54]:
hd.RMSE(Observed,Predictions)

1087.2973261090815

<html>
<font color = blue>
<b> Future Value RMSE</b>
</font>
</html>

In [55]:
hd.RMSE(observed,preds)

2974.269556489125

In [56]:
hd.MAPE(observed,preds)

51.51014677728935

<html>
<font color = red size = 4>
<b> Insights:</b>
</font>
</html>

<html>
<font color = blue>
This RMSE and MAPE values are much lesser compared to the values from ARIMA model and LSTM Tensorflow attempted. <br />
Hence Holt WInters Triple Exponential Smoothing - Additive method seems to be more dependable for wow dataset timeseries forecasting.
</font>
</html>

<html>
<font color = blue>
Accuracy of this model is highly dependant on the choice of values chosen for alpha, beta and gamma parameters.
Eventhough there is a function available to calculate this value, choosing the optimal parameters is more based on trial and error. <br />
Also, this model is more suitable for forecasting short term time points in future. If the model is attempted to predicted for longer term periods in future, accuracy would be very less and values might just show a linear trend since historical data for these kind of games would become stale as the time increases.
</font>
</html>