## Data Collection

The dataset used was taken from Humanitarian Data Exchange (HDX). It is the World Food Programme Price Database which contains data regarding food prices in the Philippines. The data in this dataset spans from 2000 - 2023 garnering 184,714 entries (rows). The dataset is in the file named `wfp_food_prices_phl.csv` in the root folder but it was also made publicly available by HDX through the following:

Link: https://data.humdata.org/dataset/wfp-food-prices-for-philippines?force_layout=desktop.

We also manually added the yearly inflation rate from 2000 to 2023 which we got from WorldData.info

Link: https://www.worlddata.info/asia/philippines/inflation-rates.php

#### The dataset is imported as follows:

In [1]:
#import libraries
import pandas as pd
import numpy as np

#import the dataset
data = pd.read_csv('wfp_food_prices_phl.csv')
data.head()

Unnamed: 0,date,admin1,admin2,market,latitude,longitude,category,commodity,unit,priceflag,pricetype,currency,price,usdprice,inflation
0,1/15/00,National Capital region,Metropolitan Manila,Metro Manila,14.604167,120.982222,cereals and tubers,Maize flour (yellow),KG,actual,Retail,PHP,15.0,0.3717,3.98
1,1/15/00,National Capital region,Metropolitan Manila,Metro Manila,14.604167,120.982222,cereals and tubers,"Rice (milled, superior)",KG,actual,Retail,PHP,20.0,0.4957,3.98
2,1/15/00,National Capital region,Metropolitan Manila,Metro Manila,14.604167,120.982222,cereals and tubers,"Rice (milled, superior)",KG,actual,Wholesale,PHP,18.35,0.4548,3.98
3,1/15/00,National Capital region,Metropolitan Manila,Metro Manila,14.604167,120.982222,cereals and tubers,"Rice (regular, milled)",KG,actual,Retail,PHP,18.0,0.4461,3.98
4,1/15/00,National Capital region,Metropolitan Manila,Metro Manila,14.604167,120.982222,cereals and tubers,"Rice (regular, milled)",KG,actual,Wholesale,PHP,16.35,0.4052,3.98


#### Inspecting the columns of the dataset:

In [2]:
#check the columns of the dataset
print("Columns: ", data.columns)
#check the shape of the dataset
print("Data Shape: ",data.shape)

Columns:  Index(['date', 'admin1', 'admin2', 'market', 'latitude', 'longitude',
       'category', 'commodity', 'unit', 'priceflag', 'pricetype', 'currency',
       'price', 'usdprice', 'inflation'],
      dtype='object')
Data Shape:  (184714, 15)


# The columns are the following:

1) date - The date of the data entry
2) admin1 - The region of the data entry
3) admin2 - The province of the data entry
4) market - The city of the data entry
5) latitude - The latitude of the market
6) longitude - The longitude of the market
7) category - The category of the food item
8) commodity - The food item
9) unit - The unit of the food item
10) priceflag - If it is the actual or projected or aggregate price.
11) pricetype - The type of price (retail, wholesale, etc)
12) currency - The currency of the price
13) price - The price of the food item
14) usdprice - The price of the food item in USD
15) inflation - The yearly inflation rate


## Data preprocessing
The data was preprocessed as follows, where the columns to be used in the features and target data were identified, as well as the train and test data:

#### Inspecting the datatype of each column:

In [3]:
#check the datatypes of each column
data.dtypes

date          object
admin1        object
admin2        object
market        object
latitude     float64
longitude    float64
category      object
commodity     object
unit          object
priceflag     object
pricetype     object
currency      object
price        float64
usdprice     float64
inflation    float64
dtype: object

#### Converting data into appropriate types

The features with object datatypes must be converted to their more appropriate datatypes. In this case, however, we will only convert (decompose) the <i>date</i> column into (<i>year, month</i>). For the other categorical columns (<i>commodity, pricetype</i>, etc.), we will just get dummy values in order to fit them to the model.

In [4]:
#Change the datatype of date and get the year and month
#Add a year and month column which will be used for our regression model
data['date'] = data['date'].astype('datetime64[ns]')
data['year'] = data['date'].dt.year
data['month'] = data['date'].dt.month
data.head(2)

Unnamed: 0,date,admin1,admin2,market,latitude,longitude,category,commodity,unit,priceflag,pricetype,currency,price,usdprice,inflation,year,month
0,2000-01-15,National Capital region,Metropolitan Manila,Metro Manila,14.604167,120.982222,cereals and tubers,Maize flour (yellow),KG,actual,Retail,PHP,15.0,0.3717,3.98,2000,1
1,2000-01-15,National Capital region,Metropolitan Manila,Metro Manila,14.604167,120.982222,cereals and tubers,"Rice (milled, superior)",KG,actual,Retail,PHP,20.0,0.4957,3.98,2000,1


#### Removing invalid entries:
Next, the data must be cleaned by removing invalid entries. This was done by dropping the null values or rows with price == 0.

In [5]:
#Drop all the null values and the rows with price == 0 
data.dropna(inplace= True)
data.drop(data.loc[data['price']==0].index, inplace=True)

#### Removing unnecessary columns:
The dataset was refined further by excluding columns that are either redundant or irrelevant to the precitive model (since including such features would only introduce noise during training). For example, <i>date</i> would redundant to <i>month</i> and <i>year</i> columns), while <i>admin1, admin2, </i>and <i>market</i> would be redundant to <i>latitude-longitude</i>. We chose to use latitude-longitude to also help estimate the price of food items in areas where there are no available data entries.

In [6]:
#Drop the unnecessary columns from the dataset
data = data.drop(['date', 'admin1','admin2','market','category', 'currency', 'unit'
        ,'usdprice'], axis='columns')
data.head(3)

Unnamed: 0,latitude,longitude,commodity,priceflag,pricetype,price,inflation,year,month
0,14.604167,120.982222,Maize flour (yellow),actual,Retail,15.0,3.98,2000,1
1,14.604167,120.982222,"Rice (milled, superior)",actual,Retail,20.0,3.98,2000,1
2,14.604167,120.982222,"Rice (milled, superior)",actual,Wholesale,18.35,3.98,2000,1


#### Converting categorical columns:
Some of the features in the dataset are categorical and cannot be properly represented (or evaluated) using integer label assignments within their columns, especially since the categories within the feature do not naturally have a relationship with each other (e.g., <i>commodity_banana</i> and <i>commodity_rice</i> under <i>commodity</i> are unrelated). Hence, the categories within each categorical feature (<i>commodity, priceflag,</i> and <i>pricetype</i>) are converted into their own columns using dummy values:

In [7]:
#Get dummy values for the categorical columns
data = pd.get_dummies(data=data)
data.head(2)

Unnamed: 0,latitude,longitude,price,inflation,year,month,commodity_Anchovies,commodity_Bananas (lakatan),commodity_Bananas (latundan),commodity_Bananas (saba),...,commodity_Sweet potatoes,commodity_Taro,commodity_Tomatoes,commodity_Water spinach,priceflag_actual,"priceflag_actual,aggregate",priceflag_aggregate,pricetype_Farm Gate,pricetype_Retail,pricetype_Wholesale
0,14.604167,120.982222,15.0,3.98,2000,1,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
1,14.604167,120.982222,20.0,3.98,2000,1,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0


#### Preparing the train and test sets
The train and test sets are divided into a random <i>80-20</i> partition  as follows:

In [8]:
#Divide the data into test and training sets
X = data.drop(['price'], axis='columns')
y = data['price'].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=26)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
display(X_train)

(109930, 84)
(27483, 84)
(109930,)
(27483,)


Unnamed: 0,latitude,longitude,inflation,year,month,commodity_Anchovies,commodity_Bananas (lakatan),commodity_Bananas (latundan),commodity_Bananas (saba),"commodity_Beans (green, fresh)",...,commodity_Sweet potatoes,commodity_Taro,commodity_Tomatoes,commodity_Water spinach,priceflag_actual,"priceflag_actual,aggregate",priceflag_aggregate,pricetype_Farm Gate,pricetype_Retail,pricetype_Wholesale
15312,16.016667,120.233333,3.03,2012,11,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
51810,14.604167,120.982222,2.39,2020,9,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
34435,16.486093,121.146518,2.39,2020,5,0,0,0,0,1,...,0,0,0,0,1,0,0,0,1,0
35778,11.706772,122.370090,2.39,2020,5,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
49991,8.040911,123.799419,2.39,2020,8,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73408,13.146926,123.750464,3.93,2021,2,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
137475,10.667360,122.946930,5.80,2022,7,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
77256,10.132101,124.834680,3.93,2021,3,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
59971,13.137222,123.734444,2.39,2020,10,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0


#### Normalizing the data
In order to keep the feature ranges similar, the non-dummy columns were normalized using MinMaxScaler:

In [9]:
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler()
X_train[['latitude','longitude','inflation','year','month']] = sc.fit_transform(X_train[['latitude','longitude','inflation','year','month']])
X_test[['latitude','longitude','inflation','year','month']] = sc.transform(X_test[['latitude','longitude','inflation','year','month']])

#### Summary
The dataset was preprocessed as follows

1. The following columns were removed:

    1) <b>date</b> - since there is already a month and year column which were extracted from the date column.
    2) <b>admin1, admin2, market</b> - longitude and latitude already represents the precise location
    3) <b>category</b> - this decreases the accuracy of the model because there are many commodities in a category and all of them have different prices.
    4) <b>unit, currency, usd price</b> - all of these columns are irrelevant for the model since unit is just the measurement, all currencies are Philippine peso, and USD price is just the converted price.

2. The following categorical columns were converted into indicator/dummy variables:
    1) <b>commodity</b> - 73 total categories
    2) <b>priceflag</b> - 3 total categories
    3) <b>pricetype</b> - 3 total categories

3. Non-dummy columns were normalized: latitude, longitude, inflation, year, month

##### The remaining features are then the independent variables used to predict the price.

## Model
### Performing linear regression on the dataset.
Since the aim is to predict a continuous value (price) from a given set of factors, a regression model would be the most appropriate for the dataset. 

Two linear regression models were tested (model reference: [sklearn](https://scikit-learn.org/stable/modules/linear_model.html))
1. <b>sklearn's LinearRegression()</b> - ordinary least squares regression. This basic linear regression method would try to find the coefficients to fit the model with such that the residual sum of squares between the observed targets in the dataset would be minimized.

2. <b>sklearn's RidgeCV()</b> - Ridge regression with cross validation. This uses the linear least squares model but builds upon it by incorporating a regularization method of imposing a penalty on the size of the coefficients. This regularization attempts to address the issue of multicollinearity (when multiple independent variables are highly correlated and thus muddles up determination of indiviual effects - this may be helpful for tight relationships between variables like commodity and location). Thus, this method is generally more stable. Finally, the built in cross validation allows the model to determine the best alpha parameter from multiple tests.

#### Training and testing the two regression models:

In [10]:
#Train the model using the training set
from sklearn.linear_model import LinearRegression, RidgeCV
regressor1 = LinearRegression()
regressor2 = RidgeCV(alphas=np.logspace(1, 2, 50))

regressor1.fit(X_train, y_train)
regressor2.fit(X_train, y_train)

#Use the model on the testing set
test_predictions1 = regressor1.predict(X_test) # predictions using LinearRegression (OLS)
test_predictions2 = regressor2.predict(X_test) # predictions using RidgeCV

### Results.
The results of the two regression models in the following section were evaluated according to the following metrics: <b><i>MAE, MSE, RMSE, r<sup>2</sup></i></b>.

#### Using LinearRegression (Ordinary Least Squares) and default parameters:

In [11]:
#get the MAE,MSE,RMSE, and R2 values to evaluate the model 
from sklearn.metrics import mean_absolute_error,mean_squared_error, r2_score

MAE = mean_absolute_error(y_test,test_predictions1)
MSE = mean_squared_error(y_test,test_predictions1)
RMSE = np.sqrt(MSE)
r2 = r2_score(y_test,test_predictions1)

comparison_df = pd.DataFrame({"Actual":y_test,"Predicted":test_predictions1})
display(comparison_df)

print(f"MAE = {MAE}\nMSE = {MSE}\nRMSE = {RMSE}\nr2 = {r2}")

Unnamed: 0,Actual,Predicted
0,238.36,237.250
1,40.00,49.250
2,37.50,35.750
3,42.77,24.250
4,138.00,205.500
...,...,...
27478,62.00,48.750
27479,102.75,142.000
27480,207.50,188.125
27481,36.74,28.250


MAE = 24.167676199832623
MSE = 1477.3576017856494
RMSE = 38.436409845166985
r2 = 0.8714324551306367


#### Using Ridge Regression (with built-in leave one out cross-validation) and alpha values in <i>np.logspace(-2, 2, 50)</i> :

In [12]:
print(f"alpha = {regressor2.alpha_}") # alpha value chosen by cross validation

MAE = mean_absolute_error(y_test,test_predictions2)
MSE = mean_squared_error(y_test,test_predictions2)
RMSE = np.sqrt(MSE)
r2 = r2_score(y_test,test_predictions2)

comparison_df = pd.DataFrame({"Actual":y_test,"Predicted":test_predictions2})
display(comparison_df)

print(f"MAE = {MAE}\nMSE = {MSE}\nRMSE = {RMSE}\nr2 = {r2}")

alpha = 10.0


Unnamed: 0,Actual,Predicted
0,238.36,236.569427
1,40.00,49.483868
2,37.50,36.596485
3,42.77,24.246054
4,138.00,204.944898
...,...,...
27478,62.00,51.878985
27479,102.75,141.803221
27480,207.50,187.428533
27481,36.74,29.573565


MAE = 24.17642370350396
MSE = 1476.4262153134666
RMSE = 38.42429199495375
r2 = 0.8715135093533303


#### OLS vs Ridge
The results of the two models yield an <i>RMSE</i> of around 38.43 (Pesos) and an <i>r<sup>2</sup></i> of around 0.8715. The model using Ridge regression is generally better since it is more stable. However, in this particular train-test split, both models performed with the same accuracy.

### Extending the Linear Model using Polynomial Features

The weaknesses of ordinary linear regression include the assumption that the relationship between the predictors and response are (1) Additive, meaning a predictor X<sub>i</sub> and its association with response Y is independent of the associations of other predictors X<sub>j</sub>, and (2) Linear, meaning a change in the variable Y in response to a change in X is constant. 

When we extend the linear model using polynomial features, we are essentially allowing the model to be capable of fitting non-linear relationships between the dependent and independent variables, as well as of considering the relationships between the multiple predictors in the model.

In this case, since we have a linear model that is trying to predict the price of food based on its previous prices and factors like location, we can extend the model by adding polynomial terms (or <i>interaction</i> terms) to represent the relationships between the independent variables. This would allow the model to capture the fact that the price of a food is not linearly related only to the sum individual effects of predictors like time and location, but can rather increase or decrease depending also on the combined effect of both (predictor variables would now include time_x_location instead of just time and location individually).

Extending the linear model using polynomial features can be a very effective way to improve the model's performance, especially when the relationship between the dependent and independent variables is non-linear. However, it is important to note that adding too many polynomial terms can lead to overfitting, so it is important to choose the right number of terms carefully. 

### Preprocessing

The added interaction features are as follows:
1. <b>month2</b> - adds a degree of freedom to the month variable. This will allow the month factor to be represented non-linearly (like a parabola) so that seasonal changes can be modelled more dynamically (e.g., mango prices decreasing mid-year but staying high during the first and last quarters).

2. <b>longlat</b> - represents the relationship between (and combined effect of) the latitude and longitude of a location, in addition to evaluating either, individually.

3. <b>inflyr</b> - represents the relationship between (and combined effect of) the year and inflation during that year.

4. <b>locyr</b> - represents the relationship between (and combined effect of) the year and location (using inflyr and longlat).

5. <b>locmth</b> - represents the relationship between (and combined effect of) the month and location (using month, month2, and longlat).

In [13]:
# Add polynomial features
data['month2'] = data['month']**2 # add degree of freedom
data['longlat'] = data['longitude']*data['latitude'] # long-lat relationship
data['inflyr'] = data['year']*data['inflation'] # inflation-year relationship
data['locyr'] = data['inflyr']*data['longlat'] # location-year relationship
data['locmth'] = data['month']*data['month2']*data['longlat'] # location-month relationship

#### Commodity-specific relationships
Some variables' effects on price are highly dependent on the commodity being specified. For example, <i>month</i>'s effect on price is highly dependent on the availability of the specified good (think mango or strawberry seasons) at a certain time. Thus, it might be better to model these commodities' relationiships with the other predictors individually (rather than generalize the month predictor to all other commodities). 

This was done as follows:
1. For each commodity, add a new interaction feature with each of the non-commodity predictors in the dataset (inflation, year, inflyr, longlat, etc.).
2. Remove commodity-dependent predictors that cannot be generalized to all commodities (month, month2, pricetype) 

In [14]:
#divide the data into feature and target sets
X = data.drop(['price'], axis='columns')
y = data['price'].values

# normalize non-dummy columns using MinMaxScaler
X[['latitude','longitude','inflation','year','month','month2','longlat','inflyr','locyr','locmth']] = sc.fit_transform(X[['latitude','longitude','inflation','year','month','month2','longlat','inflyr','locyr','locmth']])

# For each commodity, add a new interaction feature with each of the non-commodity predictors in the dataset
for i in X.columns:
    if i not in ['inflation','year','inflyr','longlat','latitude','longitude','locyr','locmth','month','month2','location','priceflag_actual','priceflag_actual,aggregate','priceflag_aggregate','pricetype_Farm Gate','pricetype_Retail','pricetype_Wholesale']:
        for j in ['year','inflyr','locyr','locmth','latitude','longitude','longlat','month','month2','priceflag_actual','priceflag_actual,aggregate','priceflag_aggregate','pricetype_Farm Gate','pricetype_Retail','pricetype_Wholesale']:
            #X.insert(len(X.columns),i+' x '+j,X[i]*X[j])
            X=pd.concat((X,(X[i]*X[j]).rename(i+j)),axis=1)

# remove commodity-dependent presictors
X = X.drop(['month','month2','locmth','priceflag_actual','priceflag_actual,aggregate','priceflag_aggregate','pricetype_Farm Gate','pricetype_Retail','pricetype_Wholesale'], axis='columns')

The train set can be seen in the figure below after preprocessing. It is worth noting that although there are now a lot of columns in the feature set, most of the columns are commodity-specific and unrelated to most other columns, meaning there shouldn't be drastic increase in the complexity of the model that would be enough to risk overfitting.

In [15]:
# divide into train and test set 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=26)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
display(X_train)

(109930, 1175)
(27483, 1175)
(109930,)
(27483,)


Unnamed: 0,latitude,longitude,inflation,year,commodity_Anchovies,commodity_Bananas (lakatan),commodity_Bananas (latundan),commodity_Bananas (saba),"commodity_Beans (green, fresh)",commodity_Beans (mung),...,commodity_Water spinachlongitude,commodity_Water spinachlonglat,commodity_Water spinachmonth,commodity_Water spinachmonth2,commodity_Water spinachpriceflag_actual,"commodity_Water spinachpriceflag_actual,aggregate",commodity_Water spinachpriceflag_aggregate,commodity_Water spinachpricetype_Farm Gate,commodity_Water spinachpricetype_Retail,commodity_Water spinachpricetype_Wholesale
15312,0.834606,0.200479,0.310935,0.521739,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0,0,0,0,0,0
51810,0.727313,0.300700,0.226614,0.869565,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0,0,0,0,0,0
34435,0.870263,0.322688,0.226614,0.869565,0,0,0,0,1,0,...,0.0,0.0,0.0,0.0,0,0,0,0,0,0
35778,0.507230,0.486434,0.226614,0.869565,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0,0,0,0,0,0
49991,0.228774,0.677716,0.226614,0.869565,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73408,0.616623,0.671164,0.429513,0.913043,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0,0,0,0,0,0
137475,0.428277,0.563630,0.675889,0.956522,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0,0,0,0,0,0
77256,0.387619,0.816261,0.429513,0.913043,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0,0,0,0,0,0
59971,0.615886,0.669020,0.226614,0.869565,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0,0,0,0,0,0


### Performing Linear Regression
The new model will use the same two regression methods as before.

#### Using LinearRegression (Ordinary Least Squares) and default parameters:

In [16]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

test_predictions = regressor.predict(X_test)

#get the MAE,MSE,RMSE, and R2 values to evaluate the model 
MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)
r2 = r2_score(y_test,test_predictions)

comparison_df = pd.DataFrame({"Actual":y_test,"Predicted":test_predictions})
display(comparison_df)

print(f"MAE = {MAE}\nMSE = {MSE}\nRMSE = {RMSE}\nr2 = {r2}")

Unnamed: 0,Actual,Predicted
0,238.36,214.796875
1,40.00,49.812500
2,37.50,37.109375
3,42.77,30.140625
4,138.00,160.171875
...,...,...
27478,62.00,56.304688
27479,102.75,166.921875
27480,207.50,199.917969
27481,36.74,31.203125


MAE = 17.38388128570389
MSE = 930.2825500686646
RMSE = 30.500533603015285
r2 = 0.9190418465017706


#### Using Ridge Regression:

In [17]:
regressor = RidgeCV(alphas=np.logspace(-2,2,50))
regressor.fit(X_train, y_train)
print(f"alpha = {regressor.alpha_}")

test_predictions = regressor.predict(X_test)

#get the MAE,MSE,RMSE, and R2 values to evaluate the model 
MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)
r2 = r2_score(y_test,test_predictions)

comparison_df = pd.DataFrame({"Actual":y_test,"Predicted":test_predictions})
display(comparison_df)

print(f"MAE = {MAE}\nMSE = {MSE}\nRMSE = {RMSE}\nr2 = {r2}")

alpha = 0.01


Unnamed: 0,Actual,Predicted
0,238.36,215.363156
1,40.00,50.784545
2,37.50,36.704489
3,42.77,31.606012
4,138.00,159.783664
...,...,...
27478,62.00,56.539222
27479,102.75,163.463086
27480,207.50,198.244000
27481,36.74,32.467403


MAE = 17.45419336766678
MSE = 947.4384096443913
RMSE = 30.780487482240943
r2 = 0.9175488520208752


## Results

### Summary of metrics

From the results above, it can be observed that the model with polynomial features <b>(r<sup>2</sup>≈0.92; RMSE≈30.48)</b> performs better than the previous model without polynomial features <b>(r<sup>2</sup>≈0.87; RMSE≈38.42)</b>. This is possibly because the model with polynomial features is able to represent the non-linear relationships between the dependent and independent variables. 

Between the two models with the polynomial features, LinearRegression and Ridge Regression have around the same <i>r<sup>2</sup></i> and <i>RMSE</i> score, with Linear being slightly better than Ridge, but considering the fact that Ridge  is more stable, it can be concluded that the latter is overall the better food price predictor.



### Model Comparison
A 2022 study by Rao et. al. used ARCH (autoregressive conditional heteroskedasticity) and GARCH (generalized autoregressive conditional heteroskedasticity) to predict the prices of staple food materials in India. The study used a 4392-row data from 2015-2018, which includes high price, thermal, and precipitation data. The models have an accuracy of more than 80%, with ARCH having up to 99.84% accuracy and GARCH having up to 96.57% accuracy.

Asnhari et. al. (2019) used a similar data which includes precipitation and thermal data to predict the commodity prices in Indonesia. The study used linear regression and the Fourier model with ARIMA (Autoregressive Integrated Moving Average). The latter regression model predicts the prices for all commodities with above 80% accuracy, and reportedly produces better accuracy with data in higher fluctuations. Multiple linear regression with ARIMA produced predictions with high accuracy of up to 99.84%.

The models created in the two studies took note of the correlation between the weather and the food prices, and thus included thermal and precipitation data in their models. While our models don't include these data, we included location data to take note of the variations of prices among different places. In contrast to the different time series models of these studies, which measures the volatility of the prices overtime, we used regression techniques which reduces data complexity.

References:

Asnhari, S. F., Gunawan, P. H., & Rusmawati, Y. (2019). "Predicting Staple Food Materials Price Using Multivariables Factors (Regression and Fourier Models with ARIMA)," 2019 7th International Conference on Information and Communication Technology (ICoICT), Kuala Lumpur, Malaysia, 2019, pp. 1-5, doi: 10.1109/ICoICT.2019.8835193

Rao, K. V., Srilatha, D., Jagan Mohan Reddy, D., Desanamukula, V. S., Kejela, M. L. (2022). "Regression Based Price Prediction of Staple Food Materials Using Multivariate Models", Scientific Programming, vol. 2022, Article ID 4572064, 7 pages, 2022. https://doi.org/10.1155/2022/4572064