# Forecasting for July

Technical Challenge for Data Science Candidates

You want to forecast the “Outside Temperature” for the first 9 days of the next month.

Assume that:

  - The average temperature for each day of July is constant and equal to 25 degrees;

  - For the 1st of July, the pattern of the temperatures across the day with respect to the average temperature on that day is similar to the one found on 1st of June, for the 2nd of July is similar to the average on the 2nd of June, etc.
  
Produce a “.txt” file with your forecast for July (from 1st July to 9th July) with the sample values for each time for e.g. dd/mm/yyyy, Time, Outside Temperature.

## Implementation

The algorithm I'll use is to calculate the average temperature for each day of June.

For each day in June, calculate the offsets from that average. 

Then, take those offsets and apply those to the average temperature for July of 25 degrees.

Note that this isn't a particularly good predictor because it doesn't apply like for like. It's probably better to use a Normalization method.

Probably better is to calculate the residuals for each day in June and add those. So (X - mu) / std. where std is the sample standard deviation. Gives the absolute residuals.

But better still is to calculate the percentage residuals from the mean on each day.

In [2]:
import numpy as np
import pandas as pd

import matplotlib
from cycler import cycler
import matplotlib.pyplot as plt

pd.__version__

'0.24.2'

In [3]:
# If you turn this feature on, you can display each result as it happens.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [4]:
df1 = pd.read_pickle("200606.pkl")

In [5]:
df1.info()
df1.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4319 entries, 90 to 4408
Data columns (total 20 columns):
Date                      4319 non-null object
Time                      4319 non-null object
Temp Humidity Index       4319 non-null float64
Outside Temperature       4319 non-null float64
WindChill                 4319 non-null float64
Hi Temperature            4319 non-null float64
Low Temperature           4319 non-null float64
Outside Humidity          4319 non-null int64
DewPoint                  4319 non-null float64
WindSpeed                 4319 non-null int64
Hi                        4319 non-null int64
Wind Direction            4319 non-null object
Rain                      4319 non-null float64
Barometer                 4319 non-null float64
Inside  Temperature       4319 non-null float64
Inside  Humidity          4319 non-null int64
ArchivePeriod             4319 non-null int64
dttm0                     4319 non-null datetime64[ns]
m0                        4319 non

Unnamed: 0,Date,Time,Temp Humidity Index,Outside Temperature,WindChill,Hi Temperature,Low Temperature,Outside Humidity,DewPoint,WindSpeed,Hi,Wind Direction,Rain,Barometer,Inside Temperature,Inside Humidity,ArchivePeriod,dttm0,m0,dy0
90,01/06/2006,00:00,9.6,9.6,9.6,9.6,9.5,81,6.4,1,10,WNW,0.0,1014.4,21.1,40,10,2006-06-01 00:00:00,6,1
91,01/06/2006,00:10,9.5,9.5,9.5,9.6,9.5,81,6.4,1,5,SW,0.0,1014.4,21.1,40,10,2006-06-01 00:10:00,6,1
92,01/06/2006,00:20,9.5,9.5,9.5,9.6,9.5,81,6.4,1,7,WSW,0.0,1014.3,21.1,40,10,2006-06-01 00:20:00,6,1
93,01/06/2006,00:30,9.5,9.5,9.5,9.6,9.5,82,6.6,1,6,W,0.0,1014.2,21.1,40,10,2006-06-01 00:30:00,6,1
94,01/06/2006,00:40,9.5,9.5,9.5,9.6,9.5,83,6.8,1,6,SW,0.0,1014.1,21.0,40,10,2006-06-01 00:40:00,6,1


In [6]:
## Filter down to first 9 days
ndays=9
cut0 = df1.dttm0.min().normalize() + pd.DateOffset(days=ndays)
df2 = df1[ df1['dttm0'] < cut0]

In [7]:
df2.columns

Index(['Date', 'Time', 'Temp Humidity Index   ', 'Outside Temperature',
       'WindChill', 'Hi Temperature', 'Low Temperature', 'Outside Humidity',
       'DewPoint', 'WindSpeed', 'Hi', 'Wind Direction', 'Rain', 'Barometer',
       'Inside  Temperature', 'Inside  Humidity', 'ArchivePeriod', 'dttm0',
       'm0', 'dy0'],
      dtype='object')

In [10]:
x = np.random.rand(100)*10
x

array([9.88991433, 9.67573932, 5.12870002, 7.57879188, 0.94915171,
       5.25852991, 8.23440588, 9.06819005, 7.8795339 , 2.79517412,
       4.31399461, 5.71481754, 1.79826591, 0.97849421, 2.38066395,
       7.76452657, 3.86763703, 0.29925991, 0.53135686, 5.40243638,
       8.26144198, 3.01103188, 2.14945676, 8.94202568, 5.02909971,
       6.89577487, 6.99893521, 1.81841238, 9.4207588 , 2.52834331,
       5.24626561, 2.55038515, 3.78944686, 1.65613328, 2.78729666,
       1.95563211, 4.80682303, 1.55662081, 4.16831928, 2.95410085,
       6.88639991, 0.55221514, 7.05243814, 0.02121903, 2.66835261,
       9.03662973, 4.86011783, 6.66314005, 4.80101528, 9.84079286,
       3.22220836, 3.40450419, 3.67619031, 4.48996775, 9.32837698,
       1.60127375, 4.05251874, 1.97021362, 6.69958676, 0.47953719,
       5.76255757, 0.26878649, 6.36322813, 2.13672672, 0.70468034,
       0.46183762, 2.37928596, 5.97822917, 8.28170689, 4.69537875,
       9.77272765, 3.28676836, 1.36293571, 2.35062641, 5.54104

In [11]:
np.linalg.norm(x)

55.82436757697846

In [13]:
norm1 = x / np.linalg.norm(x)
norm1

array([0.17716124, 0.17332466, 0.09187207, 0.13576136, 0.01700246,
       0.09419775, 0.14750558, 0.16244143, 0.14114865, 0.05007086,
       0.07727798, 0.10237138, 0.03221292, 0.01752808, 0.04264561,
       0.13908848, 0.06928224, 0.00536074, 0.00951837, 0.09677559,
       0.14798989, 0.05393759, 0.03850392, 0.16018141, 0.09008789,
       0.12352625, 0.1253742 , 0.03257381, 0.16875711, 0.04529103,
       0.09397806, 0.04568588, 0.06788159, 0.02966685, 0.04992975,
       0.03503187, 0.08610618, 0.02788425, 0.07466845, 0.05291777,
       0.12335831, 0.00989201, 0.12633261, 0.0003801 , 0.04779907,
       0.16187608, 0.08706087, 0.11935899, 0.08600214, 0.17628131,
       0.05772046, 0.06098599, 0.06585279, 0.08043025, 0.16710224,
       0.02868414, 0.07259408, 0.03529308, 0.12001187, 0.00859011,
       0.10322656, 0.00481486, 0.11398657, 0.03827588, 0.01262317,
       0.00827305, 0.04262092, 0.10708996, 0.1483529 , 0.08410984,
       0.17506204, 0.05887695, 0.02441471, 0.04210753, 0.09925

In [28]:
x0 = np.random.rand(10) - 0.5
x0

array([ 0.37056704,  0.41264878, -0.45848921,  0.23707945, -0.04726233,
        0.44727906,  0.2164848 , -0.28311769,  0.12031474,  0.35139535])

In [29]:
x = 10 * x0
x

array([ 3.70567036,  4.12648785, -4.58489214,  2.37079454, -0.47262327,
        4.47279057,  2.164848  , -2.83117695,  1.20314735,  3.51395348])

In [30]:
np.linalg.norm(x)
x1 = x / np.linalg.norm(x)
x1

10.204341627482146

array([ 0.36314644,  0.40438551, -0.44930798,  0.23233194, -0.0463159 ,
        0.43832231,  0.2121497 , -0.27744827,  0.11790544,  0.34435867])

In [25]:
x1 * 10

array([ 1.98142443,  4.35314228,  2.98465367,  5.30803855,  1.40755114,
       -3.22895542, -4.64535262, -0.90341954,  1.39819572,  1.81187924])

In [33]:
np.mean(x)

1.3668999782562579

In [45]:
x0 = np.random.rand(10)
x0 = np.sort(x0)
x0

array([0.00733998, 0.04458752, 0.16431504, 0.24951028, 0.33313448,
       0.66088623, 0.75798901, 0.79348466, 0.81186477, 0.88028389])

In [49]:
x1 = x0 * 100
x1

array([ 0.73399787,  4.45875171, 16.43150412, 24.95102804, 33.31344848,
       66.08862296, 75.79890085, 79.34846578, 81.18647728, 88.02838916])

In [50]:
xbar = np.mean(x1)
xbar

47.03395862404958

In [57]:
r1 = (x1 - xbar)/xbar
r1

array([-0.9843943 , -0.90520144, -0.65064595, -0.46951035, -0.29171498,
        0.40512568,  0.61157817,  0.6870463 ,  0.72612469,  0.87159218])

In [59]:
(1+r1 )* xbar

array([ 0.73399787,  4.45875171, 16.43150412, 24.95102804, 33.31344848,
       66.08862296, 75.79890085, 79.34846578, 81.18647728, 88.02838916])

In [61]:
x0 + 10

array([10.00733998, 10.04458752, 10.16431504, 10.24951028, 10.33313448,
       10.66088623, 10.75798901, 10.79348466, 10.81186477, 10.88028389])

In [67]:
x1 * (1.15)

array([  0.84409755,   5.12756447,  18.89622974,  28.69368225,
        38.31046575,  76.0019164 ,  87.16873598,  91.25073565,
        93.36444887, 101.23264753])

In [73]:
np.mean(x1 * 1.15) * (1+r1) - x1

array([ 0.11009968,  0.66881276,  2.46472562,  3.74265421,  4.99701727,
        9.91329344, 11.36983513, 11.90226987, 12.17797159, 13.20425837])

In [74]:
x1

array([ 0.73399787,  4.45875171, 16.43150412, 24.95102804, 33.31344848,
       66.08862296, 75.79890085, 79.34846578, 81.18647728, 88.02838916])

In [75]:
( x1 - np.min(x1) ) / (np.max(x1) - np.min(x1))

array([0.        , 0.04266888, 0.17982262, 0.27741794, 0.37321356,
       0.74866923, 0.85990522, 0.90056723, 0.92162255, 1.        ])

In [108]:
np.var([1,2,3],ddof=1)
np.var([1,2,3])

np.std([1,2,3],ddof=1)
np.std([1,2,3])

1.0

0.6666666666666666

1.0

0.816496580927726

In [146]:
# Like a day in June.
mu, sigma = 16, 0.4 # mean and standard deviation
s0=np.random.normal(mu, sigma, 24)
s0=np.sort(s)
s0

array([15.24719949, 15.37102981, 15.38688686, 15.49529736, 15.54827515,
       15.6542128 , 15.72240757, 15.77599948, 15.85958354, 15.89364488,
       15.89941315, 15.97626295, 16.07604057, 16.09231979, 16.11318449,
       16.11379807, 16.1294378 , 16.1351105 , 16.22206166, 16.34535447,
       16.35392   , 16.53795184, 16.55596885, 16.60389196])

In [147]:
# Sample statistics
mu0 = np.mean(s0)
std0 = np.std(s0, ddof=1)
(mu0, std0)
r0 = (s0 - mu0)/std0
r0

(15.962885543369504, 0.3836068309359353)

array([-1.86567598, -1.54287068, -1.50153396, -1.21892559, -1.08082121,
       -0.80465915, -0.62688658, -0.48718126, -0.26929136, -0.18049905,
       -0.16546211,  0.03487269,  0.29497656,  0.33741381,  0.39180468,
        0.39340416,  0.43417439,  0.44896218,  0.67562958,  0.99703367,
        1.0193626 ,  1.49910339,  1.54607078,  1.67099845])

In [148]:
# Like a day in July
mu1 = 22
s1=np.ones(s0.size) * mu1
s1

array([22., 22., 22., 22., 22., 22., 22., 22., 22., 22., 22., 22., 22.,
       22., 22., 22., 22., 22., 22., 22., 22., 22., 22., 22.])

In [149]:
s1x = s1 + r0
s1x

array([20.13432402, 20.45712932, 20.49846604, 20.78107441, 20.91917879,
       21.19534085, 21.37311342, 21.51281874, 21.73070864, 21.81950095,
       21.83453789, 22.03487269, 22.29497656, 22.33741381, 22.39180468,
       22.39340416, 22.43417439, 22.44896218, 22.67562958, 22.99703367,
       23.0193626 , 23.49910339, 23.54607078, 23.67099845])

In [150]:
mu1 = np.mean(s1x)
std1 = np.std(s1x, ddof=1)
(mu1, std1)
r1x = (s1x - mu1)/std1
r1x

(22.0, 0.9999999999999999)

array([-1.86567598, -1.54287068, -1.50153396, -1.21892559, -1.08082121,
       -0.80465915, -0.62688658, -0.48718126, -0.26929136, -0.18049905,
       -0.16546211,  0.03487269,  0.29497656,  0.33741381,  0.39180468,
        0.39340416,  0.43417439,  0.44896218,  0.67562958,  0.99703367,
        1.0193626 ,  1.49910339,  1.54607078,  1.67099845])