# Correlation

Correlation is the measure of the relationship between two entites or variables.  
Two entitites/variables are said to have positive corellation if both entities/varibles either:  
   1) increase with time  
   2) Decrease with time  
The brightness of a lightbulb is positively correlated to the amount of power supplied to the lightbulb.  
Two entitites/variables are said to have positive corellation if both entities/varibles move in opposite directions with time.  
Number of games played on the a phone is negatively correlated to battery life.  

A feature of correlation is direction.  
If both variables are increasing, it is referred to that as positive correlation.  
If both variables are decreasing, it is referred to that as negative correlation.  
If both variables do not appear to have discernable pattern (sattered), it is said to have no correlation.


In [None]:
import pandas as pd
import matplotlib.pyplot as pyplot
# from google.colab import files

!ls
# !rm "multichoice_call_volumes.csv"
# uploaded = files.upload()

In [None]:
# This dataset represent the number of complaints a company recieves between
# the hours of 7 a.m. to 10 p.m. from 2015-04-01 to 2018-02-18 among 
# two products they offer.
data = pd.read_csv("multichoice_call_volumes.csv")
data.tail()


In [None]:
 data.dtypes

Convert the Date column to a date object

In [None]:
data.Date = pd.to_datetime(data.Date)
data.Time = data.Time.astype(str) +  ":00"
data.dtypes

In [None]:
data.head()
# data.Time.map(type)

##Correlation Direction
Correlation between two varaible are measured by a correlation cofficient  
Correlation coefficient takes on values between -1 and +1 i.e. -1 < r < 1  

Both sign and magnitude of the correllation coefficent are important

A positive correlation implies 0 < r < 1  
A negative correlation implies -1 < r < 0  
No correlation imples r = 0

To find the correlation between the two products we use the function `pd.corr`


In [None]:
data['product_1'].corr(data['product_2'])

From the correlation coeffiecent value, we can say product_1 and product_2 have a positve correlation.

## Correlation Strength
The closer the correlation coefficient is to 1 the higher the correlation regardless of the the sign  

A perfect positive correlation implies r = ±1.  
A high or strong positive correlation implies ±0.8 < r <= ±1  
A medium positive correlation implies ±0.5 < r <= ±0.8  
A low positive correlation implies r <= ±0.4


In [None]:
data['product_1'][1:6].corr(data['product_2'][1:6])

The above correlation coefficient indicates a high positve correlation between product_1 and product_2

In [None]:
data['product_1'][6:12].corr(data['product_2'][6:12])


The above correlation coefficient indicates a medium negative correlation between product_1 and product_2

# AutoCorrelation 



In [None]:
from pandas import Series
data_clean = data
data_clean['Date'] = pd.to_datetime(data['Date'].apply(str)+' '+ data['Time'])
# data_clean


# The code will create the dataset as a Pandas Series.
ts = Series(data['product_1'].values, index=data.Date) 
plt = ts[1:1000].plot()
pyplot.show(plt)

In [None]:
# create a series object from the second product (product_2) and plot the values

## Quick Check for Autocorrelation

There is a quick, visual check that we can do to see if there is an autocorrelation in our time series dataset.  
Pandas provides a built-in plot to do exactly this, called the lag_plot() function.

In [None]:
from pandas.plotting import lag_plot
lag_plot(ts)
pyplot.show()

We can see a large ball of observations along a diagonal line of the plot.  
It shows a relationship or some correlation.

## Autocorelation Plots

We can plot the correlation coefficient for each lag variable.  
This can very quickly give an idea of which lag variables may be good candidates for use in a predictive model and how the relationship between the observation and its historic values changes over time.  
Pandas provides a built-in plot called the autocorrelation_plot() function.  
The plot provides the lag number along the x-axis and the correlation coefficient value between -1 and 1 on the y-axis. The plot also includes solid and dashed lines that indicate the 95% and 99% confidence interval for the correlation values. Correlation values above these lines are more significant than those below the line, providing a threshold or cutoff for selecting more relevant lag values.

In [None]:
from pandas.plotting import autocorrelation_plot

autocorrelation_plot(ts[1:1000])
pyplot.show()

Running the example shows the swing in positive and negative correlation as the call volume values change across seasons each previous year.

In [None]:
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(ts, lags=31)
pyplot.show()

In the above example, we limit the lag variables evaluated to 31 for readability.

To calculate the autocorrelation of the data with different lags pandas provides the function autocorr on the series object which accepts an integer parameter for the lag.  
In the code below i want to find the autocorrelation from lag=1 to lag=100.  
The function used in general terms is referred to as the autocorrelation function(ACF)

In [None]:
from pandas.plotting import autocorrelation_plot

auto_correlation = [ts.autocorr(i) for i in range(0,100)]
auto_correlation[1:10] # show the first 10 values. Remove the indexing to see all



In [None]:
# find the autocorrelation for the second product (product_2)
# auto_correlation_2


To graph it we will use the autocorrelation_plot function from pandas. The function accepts a pandas series and automatically calculates the autocorrelation at varying time lags.

# Autoregression Model
An autoregression model is a linear regression model that uses lagged variables as input variables.  

The statsmodels library provides an autoregression model that automatically selects an appropriate lag value using statistical tests and trains a linear regression model. It is provided in the AR class.  

We can use this model by first creating the model AR() and then calling fit() to train it on our dataset. This returns an ARResult object.  

Once fit, we can use the model to make a prediction by calling the predict() function for a number of observations in the future.  

In [None]:
from statsmodels.tsa.ar_model import AR
from sklearn.metrics import mean_squared_error

# split dataset to get only the values
ts_values = ts.values

train, test = ts_values[1:len(ts_values)-16], ts_values[len(ts_values)-16:]

# train autoregression
model = AR(train)
model_fit = model.fit()
print('Lag: %s' % model_fit.k_ar)
print('Coefficients: %s' % model_fit.params)

predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False)
for i in range(len(predictions)):
	print('predicted=%f, expected=%f' % (predictions[i], test[i]))
error = mean_squared_error(test, predictions)
print('Test MSE: %.3f' % error)
# plot results
pyplot.plot(test)
pyplot.plot(predictions, color='red')
pyplot.show()

Running the example first prints the chosen optimal lag and the list of coefficients in the trained linear regression model.  

We can see that a 43-lag model was chosen and trained.  

The 16 hour forecast is then printed and the mean squared error of the forecast is summarized.  

A plot of the expected (blue) vs the predicted values (red) is made.

The forecast does looks ok (about 5 callers out each hour), with big deviation on hour 3 (9 am).

### Model Updates

The statsmodels API does not make it easy to update the model as new observations become available.  

One way would be to re-train the AR model each day as new observations become available, would be to use the learned coefficients and manually make predictions. This requires that the history of 43 prior observations be kept and that the coefficients be retrieved from the model and used in the regression equation to come up with new forecasts.

The coefficients are provided in an array with the intercept term followed by the coefficients for each lag variable starting at t-1 to t-n. We simply need to use them in the right order on the history of observations, as follows:

        yhat = b0 + b1*X1 + b2*X2 ... bn*Xn

In [None]:
from pandas import Series
from matplotlib import pyplot
from statsmodels.tsa.ar_model import AR
from sklearn.metrics import mean_squared_error
# split dataset

ts_values = ts.values
train, test = ts_values[1:len(ts_values)-16], ts_values[len(ts_values)-16:]


# train autoregression
model = AR(train)
model_fit = model.fit()
window = model_fit.k_ar
coef = model_fit.params

# walk forward over time steps in test
history = train[len(train)-window:]
history = [history[i] for i in range(len(history))]
predictions = list()
for t in range(len(test)):
	length = len(history)
	lag = [history[i] for i in range(length-window,length)]
	yhat = coef[0]
	for d in range(window):
		yhat += coef[d+1] * lag[window-d-1]
	obs = test[t]
	predictions.append(yhat)
	history.append(obs)
	print('predicted=%f, expected=%f' % (yhat, obs))
error = mean_squared_error(test, predictions)
print('Test MSE: %.3f' % error)
# plot
pyplot.plot(test)
pyplot.plot(predictions, color='red')
pyplot.show()

There is now a small improvement in the forecast as the red line is closer to the blue