#Time Series Prediction on the Food Commodities Prices in Bandung
**Author:** Ferdinand Lanvino<br>
**Start Date:** 8th June, 2020<br>
**Purpose:** Extract a meaningful information based on prediction model and data analysis of food commoditites<br>
**Objective:** Build a prediction model of food commodities prices using various time-series forecasting methods, based on Python.<br>

Table of Contents:
1. [Q&A](#q&a)
2. [Preparing the Project Environment](#preparing-the-project-environment)


## 1. Q&A <a name="q&a"></a>
**1. What is Time series analysis?**  
A. Time Series is a series of observations taken at specified time intervals usually equal intervals. Analysis of the series helps us to predict future values based on previous observed values. In Time series, we have only 2 variables, time & the variable we want to forecast.  

  
**2. Why & where Time Series is used?**  
A. Time series data can be analyzed in order to extract meaningful statistics and other characteristics. It's used in at least the 4 scenarios:  
    a) Business Forecasting  
    b) Understand past behavior  
    c) Plan the future  
    d) Evaluate current accomplishment  
  
**3. When shouldn't we use Time Series Analysis?**  
A. We don't need to apply Time series in at least the following 2 cases:  
    a) The dependant variable(y) (that is supposed to vary with time) is constant. Eq: y=f(x)=4, a line parallel to x-axis(time) will always remain the same.  
    b) The dependant variable(y) represent values that can be denoted as a mathematical function. Eq: sin(x), log(x), Polynomials etc. Thus, we can directly get value at some time using the function itself. No need of forecasting.  
  
**4. What are the components of Time Series?**  
A. There are 4 components:  
    a) Trend - Upward & downward movement of the data with time over a large period of time. Eq: Appreciation of Dollar vs rupee.  
    b) Seasonality - seasonal variances. Eq: Ice cream sales increases in Summer only  
    c) Noise or Irregularity - Spikes & troughs at random intervals  
    d) Cyclicity - behavior that repeats itself after large interval of time, like months, years etc.  
    
**5. What is Stationarity?**    
A. Before applying any statistical model on a Time Series, the series has to be stationary, which means that, over different time periods,  
    a) It should have constant mean.  
    b) It should have constant variance or standard deviation.  
    c) Auto-covariance should not depend on time.  

Trend & Seasonality are two reasons why a Time Series is not stationary & hence need to be corrected.
    
**6. Why does Time Series(TS) need to be stationary?**  
A. It is because of the following reasons:  
    a) If a TS has a particular behavior over a time interval, then there's a high probability that over a different interval, it will have same behavior, provided TS is stationary. This helps in forecasting accurately.  
    b) Theories & Mathematical formulas ae more mature & easier to apply for as TS which is stationary.  

**7. Tests to check if a series is stationary or not**  
A. There are 2 ways to check for Stationarity of a TS:  
    a) Rolling Statistics - Plot the moving avg or moving standard deviation to see if it varies with time. Its a visual technique.  
    b) ADCF Test - Augmented Dickey–Fuller test is used to gives us various values that can help in identifying stationarity. The Null hypothesis says that a TS is non-stationary. It comprises of a **Test Statistics** & some **critical values** for some confidence levels. If the Test statistics is less than the critical values, we can reject the null hypothesis & say that the series is stationary. THE ADCF test also gives us a **p-value**. Acc to the null hypothesis, lower values of p is better.
    
**8. What is ARIMA model?**      
A. ARIMA(Auto Regressive Integrated Moving Average) is a combination of 2 models AR(Auto Regressive) & MA(Moving Average). It has 3 hyperparameters - P(auto regressive lags),d(order of differentiation),Q(moving avg.) which respectively comes from the AR, I & MA components. The AR part is correlation between prev & current time periods. To smooth out the noise, the MA part is used. The I part binds together the AR & MA parts. 

**9. How to find value of P & Q for ARIMA ?**  
A. We need to take help of ACF(Auto Correlation Function) & PACF(Partial Auto Correlation Function) plots.
ACF & PACF graphs are used to find value of P & Q for ARIMA. We need to check, for which value in x-axis, graph line drops to 0 in y-axis for 1st time.  
From PACF(at y=0), get P  
From ACF(at y=0), get Q  

**10. What Is ADCF test?**  
A. In statistics and econometrics, an augmented Dickey–Fuller test (ADF) tests the null hypothesis that a unit root is present in a time series sample. The alternative hypothesis is different depending on which version of the test is used, but is usually stationarity or trend-stationarity. It is an augmented version of the Dickey–Fuller test for a larger and more complicated set of time series models.

The augmented Dickey–Fuller (ADF) statistic, used in the test, is a negative number. The more negative it is, the stronger the rejection of the hypothesis that there is a unit root at some level of confidence.

p value(0<=p<=1) should be as low as possible. Critical values at different confidence intervals should be close to the Test statistics value.

**11. What is Exponential Smoothing?**  
A. *Exponential smoothing* is a rule of thumb technique for smoothing time series data using the exponential window function. Whereas in the simple moving average the past observations are weighted equally, exponential functions are used to assign exponentially decreasing weights over time. It is an easily learned and easily applied procedure for making some determination based on prior assumptions by the user, such as seasonality. Exponential smoothing is often used for analysis of time-series data.

The raw data sequence is often represented by ${x_{t}}$ beginning at time $t=0$, and the output of the exponential smoothing algorithm is commonly written as ${s_{t}}$, which may be regarded as a best estimate of what the next value of $x$ will be. When the sequence of observations begins at time $t=0$, the simplest form of exponential smoothing is given by the formulas:  

$s_{0} = x_{0}$  
$s_{t} = α*x_{t} + (1-α)*s_{t-1}$  , $t>0$  

where $α$ is the smoothing factor, and $0<α<1$.

**12. What is Exponential decay?**  
A. A quantity is subject to exponential decay if it decreases at a rate proportional to its current value. Symbolically, this process can be expressed by the following differential equation, where N is the quantity and λ (lambda) is a positive rate called the exponential decay constant:

$dN/dt = -λN$

The solution to this equation (see derivation below) is:  
$N(t) = N_{0}*e^{-λt}$  

where N(t) is the quantity at time t, and N0 = N(0) is the initial quantity, i.e. the quantity at time t = 0.  

**Half Life** is the time required for the decaying quantity to fall to one half of its initial value. It is denoted by $t_{1/2}$. The half-life can be written in terms of the decay constant as:  

$t_{1/2} = ln(2)/λ$  

## 2. Preparing the Project Environment <a name="preparing-the-project-environment"></a>

Cloning the dataset


In [20]:
# Clone commodity-predictor repo.
!git clone -l -s git://github.com/ferdinand-lanvino/commodity-predictor.git
%cd commodity-predictor
!ls

Cloning into 'commodity-predictor'...
remote: Enumerating objects: 77, done.[K
remote: Counting objects:   1% (1/77)[Kremote: Counting objects:   2% (2/77)[Kremote: Counting objects:   3% (3/77)[Kremote: Counting objects:   5% (4/77)[Kremote: Counting objects:   6% (5/77)[Kremote: Counting objects:   7% (6/77)[Kremote: Counting objects:   9% (7/77)[Kremote: Counting objects:  10% (8/77)[Kremote: Counting objects:  11% (9/77)[Kremote: Counting objects:  12% (10/77)[Kremote: Counting objects:  14% (11/77)[Kremote: Counting objects:  15% (12/77)[Kremote: Counting objects:  16% (13/77)[Kremote: Counting objects:  18% (14/77)[Kremote: Counting objects:  19% (15/77)[Kremote: Counting objects:  20% (16/77)[Kremote: Counting objects:  22% (17/77)[Kremote: Counting objects:  23% (18/77)[Kremote: Counting objects:  24% (19/77)[Kremote: Counting objects:  25% (20/77)[Kremote: Counting objects:  27% (21/77)[Kremote: Counting objects:  28% (22/77)[Kremot

Pull the latest update from Github

In [21]:
!git pull

Already up to date.


##2. Preprocessing the dataset

Importing the libraries

In [0]:
#basic pacakges
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O
import datetime # manipulating date formats

#visualization
import matplotlib as plt # basic plotting
import seaborn as sns # for prettier plots


Importing the dataset

In [27]:
dataset = pd.read_csv('Dataset/Unprocessed/combined.csv')
dataset

Unnamed: 0,komoditas,tanggal,harga
0,Beras,25/07/2016,10900.0
1,Beras,26/07/2016,10950.0
2,Beras,27/07/2016,10950.0
3,Beras,28/07/2016,10950.0
4,Beras,29/07/2016,10950.0
...,...,...,...
26227,Gula Pasir Lokal,23/12/2019,13000.0
26228,Gula Pasir Lokal,26/12/2019,13000.0
26229,Gula Pasir Lokal,27/12/2019,13000.0
26230,Gula Pasir Lokal,30/12/2019,13000.0


In [33]:
#date formatting
dataset['tanggal'] = pd.to_datetime(dataset['tanggal'],format='%d/%m/%Y') #convert from string to datetime
#checking data
print(ds.info())
indexedDataset = dataset.set_index(['tanggal'])
indexedDataset.head(5)
print(indexedDataset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26232 entries, 0 to 26231
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   komoditas  26232 non-null  object        
 1   tanggal    26232 non-null  datetime64[ns]
 2   harga      26142 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 614.9+ KB
None
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 26232 entries, 2016-07-25 to 2019-12-31
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   komoditas  26232 non-null  object 
 1   harga      26142 non-null  float64
dtypes: float64(1), object(1)
memory usage: 614.8+ KB
None


In [36]:
# Aggregate to monthly level the required metrics
monthly_data = dataset.groupby("komoditas")["tanggal","harga"].agg({"tanggal":["min","max"]})
monthly_data

  


Unnamed: 0_level_0,tanggal,tanggal,tanggal,tanggal
Unnamed: 0_level_1,tanggal,tanggal,harga,harga
Unnamed: 0_level_2,min,max,min,max
komoditas,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
Bawang Merah,2016-07-25,2019-12-31,22500.0,54500.0
Bawang Merah Ukuran Sedang,2016-07-25,2019-12-31,22500.0,54500.0
Bawang Putih,2016-07-25,2019-12-31,23500.0,87500.0
Bawang Putih Ukuran Sedang,2017-01-03,2019-12-31,23500.0,87500.0
Bawang Putih Ukuran Sedang (kg),2016-07-25,2016-12-30,36000.0,44000.0
Beras,2016-07-25,2019-12-31,10900.0,12650.0
Beras Kualitas Bawah I,2016-07-25,2019-12-31,9750.0,11900.0
Beras Kualitas Bawah II,2016-07-25,2019-12-31,9000.0,11500.0
Beras Kualitas Medium I,2016-07-25,2019-12-31,11400.0,14500.0
Beras Kualitas Medium II,2016-07-25,2019-12-31,10850.0,13650.0
