Potential of daily wholesale data (2005-2014) for prediction vs. daily retail (2009-2013) #26

halccw · 2014-05-08T23:07:20Z

Before I dig into prediction, share and discuss some thoughts.

We have wholesale daily (2005-2014) and retail daily (2009-2013) datasets.

1. Include a few very good wholesale daily series into prediction goals

The wholesale daily dataset is sparse, but we have some very good series with more than 80%~90% of valid data in over 10 years which also appear very volatile and periodic. Although they are only tiny portions of the whole picture, I suggest we could still make good use of them to produce individual predictions.

Pre-interpolation graphs per region (zoom in or click it to see clearer graphs):

Uttar Pradesh
Apple and onion appear volatile and periodic, but we should discard the rice here, since its price is very stable.

West Bengal
Observe the periodic clustering of high volatility.

Gujarat
Super volatile potato.

NCT of Delhi
Wheat price

Some more to come tomorrow.

mstefanro · 2014-05-08T23:21:38Z

Do you think we should just pick the K best time-series and attempt to
predict
those? This will save us from having to care about subproducts or
cities, since
we are simply predicting "things for which there is data", rather than
trying to
uniformly predict the same things for each region.

On 05/09/2014 01:07 AM, chingchia wrote:

Before I dig into prediction, share and discuss some thoughts.
    1. Include a few very good /wholesale daily/ series into
    prediction goals
The wholesale daily dataset is sparse, but we have some very good
series with more than 80%~90% of valid data in over 10 years which
also appear very volatile and periodic. Although they are only tiny
portions of the whole picture, I suggest we could still make good use
of them to produce individual predictions.

Pre-interpolation graphs per region (zoom in or click it to see
clearer graphs):

Uttar Pradesh
Apple and onion appear volatile and periodic, but we should discard
the rice here, since its price is very stable.
1
https://cloud.githubusercontent.com/assets/4166714/2922388/f20e8120-d700-11e3-9566-91cf3018b245.png

West Bengal
Observe the periodic clustering of high volatility.
2
https://cloud.githubusercontent.com/assets/4166714/2922391/f23a6f9c-d700-11e3-89e2-944bd69fcde7.png

Gujarat
Super volatile potato.
3
https://cloud.githubusercontent.com/assets/4166714/2922390/f2379506-d700-11e3-876e-54fc72c228f7.png

NCT of Delhi
4
https://cloud.githubusercontent.com/assets/4166714/2922389/f233e910-d700-11e3-8bd9-d1220570c72c.png

Some more to come tomorrow.

—
Reply to this email directly or view it on GitHub
#26.

f4bD3v · 2014-05-09T09:33:46Z

@mstefanro, I would say that it is the way to go given the time constraints and the quality of the data. We can choose specific series and additionally try to feed in prices in neighbouring regions, social media indicators and weather data with a researched set-off. @ChingChia Are there more of these series for wholesale data? Should we help checking the data to filter out good series or is the number very limited?
What do you think of plotting the daily or weekly retail data for the same commodities and trying to infer regions after interpolating the good wholesale series with cubic spline?

For Delhi the time series is also potato?

Considering frequency of consumption: potato, onion and apple are good choice, it would also be nice find good series for rice, wheat and lentils

halccw · 2014-05-09T13:07:25Z

@Fabbrix
For the wholesale daily dataset, unfortunately these are the only series that have >80% valid data. For the NCT of Delhi graph, it's wheat.

Please check this table to get a sense of data availability of the wholesale dataset. Each cell represents the best valid-data-rate of each (product, subproduct) of each region. (note that 0.9=90%) If it's empty, it means that there is no series with more than 60% of valid data. From the table, we have a few good individual series of rice and wheat (shown in the previous graphs), and some not bad ones having about 70%-80% of valid data.

A table inferring data availability of wholesale daily:
https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-wholesale/best_non_na_0.4.csv

The same table for retail daily :
https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-retail/best_non_na_0.4.csv

I will add some more graphs of the retail daily later.

f4bD3v · 2014-05-09T13:28:48Z

@ChingChia
Approach to build near-complete series:

Lower cut-off rate
Plot series for market places in largest cities? for regions that satisfy cut-off rate. One plot per region.
Check if the plotted time series are complementary - if so merge into one, or infer missing values in a series from the others.
Goal: get time series with near 100% support
Use these near-complete wholesale series to infer relationship between wholesale and daily retail prices and compute missing values in daily retail! (include inflation in the computation)

halccw · 2014-05-10T00:28:16Z

wholesale daily regional plots

https://www.dropbox.com/s/25vd8fg5cznqqap/wholesale_daily_regional_plots_0.6.zip

The cutoff rate is lowered to 60%
11 states in total: 'Andhra Pradesh', 'Gujarat', 'Jharkhand', 'Karnataka', 'Madhya Pradesh', 'Maharashtra', 'NCT of Delhi', 'Orissa', 'Punjab', 'Uttar Pradesh', 'West Bengal'
5 products included: 'Rice','Wheat','Apple','Potato','Onion'
for all regions for all products, 24 plots are generated
format of legend: (state, city, product, subproduct, valid-data-rate), e.g. (Andhra Pradesh, Chirala, Rice, B P T, 0.71)

halccw · 2014-05-10T07:43:54Z

Wholesale daily product plots

The cutoff rate is 60%.
For each product, I select the series with the most data availability in that region to make the graph readable
format of legend is the same as above: (state, city, product, subproduct, valid-data-rate), e.g. (Andhra Pradesh, Chirala, Rice, B P T, 0.71)

Rice

Wheat

Apple

Potato

Onion

f4bD3v · 2014-05-10T08:28:42Z

Usability review of selected wholesale series:
Maharashtra: Onion
NCT of Delhi: Potato, Wheat x2
Orissa: Wheat
Uttar Pradesh: Apple, Onion (merge all series?), Potato (merge series with by averaging + noise?), Rice coarse vs. Rice fine?, Wheat (try merging)
Gujarat (can't exactly make out series for subs): Wheat
Jharkand (Ranchi): Fine Rice
West Bengal: Potato (all series match well) => we could build a gaussian process out of them, Rice fine

halccw · 2014-05-10T09:35:15Z

The complete bundle of plots and tables for wholesale and retail, daily and weekly

including:

Datasets:

Wholesale daily
Retail daily
Wholesale weekly (downsampled from daily)
Retail weekly (downsampled from daily)

Selected products = [Rice','Wheat','Apple','Potato','Onion']

'per_region/': one plot for each selected product for each region
'per_products/': one plot for each product
'per_products_regional_best/': one plot for each product. one best series per region.
'num_series.csv': counts of series above cut off rate per region
'best_non_na.csv': best valid-data-rate per region

link: https://dl.dropboxusercontent.com/u/29566584/wholesale_retail_daily_weekly.zip

f4bD3v · 2014-05-10T15:50:24Z

Daily retail:
Maharashtra: Onion
NCT of Delhi: Onion, Potato

The regional best plot of onion seems to show a general country pattern while the standard deviation for rice and wheat stays more or less stable over the period with increasing prices (could try and match to inflation). But maybe we're introducing a bias by selecting regional best.

Weekly:
Karnataka: Potato looks very nice
Maharashtra: Onion, Potato, Rice, Wheat
NCT of Delhi: Onion, Potato
Orissa: Onion, Potato
Rajasthan: Onion
Tamil Nadu: Potato
Uttar Pradesh: Potato
West Bengal: Onion, Potato, Rice
for most of the others bad data collection is evident

The price per product plots are very nice: For the weekly data they show that the onion price is very volatile but stable across regions, while the prices for rice and wheat are less volatile, however vary greatly across regions. Potato also very volatile and some difference between regions. Also inflation seems to manifest itself more in the price of rice and wheat than in the price of potato and onion.

Empirically motivate the choice of granularity: Time Series analysis of volatility granularity?
Compute average price per region with standard deviation => compute average national price with standard deviation.
How should we proceed to compute average national prices? By region?

For the network I think it is not too important with which offset exactly we feed in the weather data, because we have the reservoir has a memory property.

halccw · 2014-05-19T08:25:16Z

By looking at every series above 60% valid rate in the 3 datasets. I realized that

Retail weekly is useless now. Even after a simple spike removing heuristic, there are too many spikes left.
Retail daily is useless when we have the wholesale daily dataset. Many of its series look strange. For the usable ones, they have very similar patterns as those in wholesale daily. Its time span is much shorter than wholesale daily too.
Wholesale daily is the most useful one. Its series perform good dynamics and with less spikes.

2. In Wholesale daily, merge series to construct an extended dataset with a more uniform profile over products and regions

Merge series with more than 60% of valid data of the same product within each region by averaging, to get (Uttar Pradesh, Potato), (West Bengal, Rice), etc.

plots per region
legend = (region, product)

f4bD3v changed the title ~~Some thoughts of possible usage of the daily datasets in prediction~~ Potential of daily wholesale data (2005-2014) for prediction vs. daily retail (2009-2013) May 9, 2014