## üçÜ Vegetable Price Prediction

Given *data about vegetables sold in a market*, let's try to predict the **price** of a given vegetable.

We will use a variety of regression models to make our predictions.

Data source: https://www.kaggle.com/datasets/sudipsamanta35/vegetable-market

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import LinearSVR, SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

import warnings
warnings.filterwarnings(action='ignore')

In [2]:
data = pd.read_csv('archive/Vegetable_market.csv')
data

Unnamed: 0,Vegetable,Season,Month,Temp,Deasaster Happen in last 3month,Vegetable condition,Price per kg
0,potato,winter,jan,15,no,fresh,20
1,tomato,winter,jan,15,no,fresh,50
2,peas,winter,jan,15,no,fresh,70
3,pumkin,winter,jan,15,no,fresh,25
4,cucumber,winter,jan,15,no,fresh,20
...,...,...,...,...,...,...,...
116,brinjal,winter,jan,15,yes,fresh,33
117,ginger,winter,jan,15,no,fresh,88
118,potato,summer,apr,32,no,fresh,24
119,peas,summer,apr,33,no,fresh,33


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121 entries, 0 to 120
Data columns (total 7 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   Vegetable                        121 non-null    object
 1   Season                           121 non-null    object
 2   Month                            121 non-null    object
 3   Temp                             121 non-null    int64 
 4   Deasaster Happen in last 3month  121 non-null    object
 5   Vegetable condition              121 non-null    object
 6   Price per kg                     121 non-null    int64 
dtypes: int64(2), object(5)
memory usage: 6.7+ KB


### Preprocessing

In [4]:
df = data.copy()

In [5]:
{column: df[column].unique() for column in df.columns}

{'Vegetable': array(['potato', 'tomato ', 'peas', 'pumkin', 'cucumber',
        'pointed grourd ', 'Raddish', 'Bitter gourd', 'onion', 'garlic',
        'cabage', 'califlower', 'chilly', 'okra', 'brinjal', 'ginger',
        'radish'], dtype=object),
 'Season': array(['winter', 'summer', 'monsoon', 'autumn', 'spring'], dtype=object),
 'Month': array(['jan', 'apr', 'july', 'sept', 'oct', 'dec', 'may', 'aug', 'june',
        ' ', 'march'], dtype=object),
 'Temp': array([15, 32, 33, 35, 37, 30, 38, 28, 27, 18, 40, 29, 41, 31, 43, 21, 26]),
 'Deasaster Happen in last 3month': array(['no', 'yes'], dtype=object),
 'Vegetable condition': array(['fresh', 'scrap', 'avarage', 'scarp'], dtype=object),
 'Price per kg': array([ 20,  50,  70,  25, 130,  10,  35,  45, 150,  80,  30, 100,  60,
        170,  40, 200,  15, 250,  90,  16,  12,  28, 120,  75,  18, 190,
        210,  42,  55,  29,  32, 132,  21,  19,  22,  33,  24,  23,  53,
         27, 123,  88,   9])}

In [6]:
# Binary encoding
df['Deasaster Happen in last 3month'] = df['Deasaster Happen in last 3month'].replace({'no': 0, 'yes': 1})

In [7]:
df

Unnamed: 0,Vegetable,Season,Month,Temp,Deasaster Happen in last 3month,Vegetable condition,Price per kg
0,potato,winter,jan,15,0,fresh,20
1,tomato,winter,jan,15,0,fresh,50
2,peas,winter,jan,15,0,fresh,70
3,pumkin,winter,jan,15,0,fresh,25
4,cucumber,winter,jan,15,0,fresh,20
...,...,...,...,...,...,...,...
116,brinjal,winter,jan,15,1,fresh,33
117,ginger,winter,jan,15,0,fresh,88
118,potato,summer,apr,32,0,fresh,24
119,peas,summer,apr,33,0,fresh,33


In [8]:
{column: df[column].unique() for column in df.select_dtypes('object').columns}

{'Vegetable': array(['potato', 'tomato ', 'peas', 'pumkin', 'cucumber',
        'pointed grourd ', 'Raddish', 'Bitter gourd', 'onion', 'garlic',
        'cabage', 'califlower', 'chilly', 'okra', 'brinjal', 'ginger',
        'radish'], dtype=object),
 'Season': array(['winter', 'summer', 'monsoon', 'autumn', 'spring'], dtype=object),
 'Month': array(['jan', 'apr', 'july', 'sept', 'oct', 'dec', 'may', 'aug', 'june',
        ' ', 'march'], dtype=object),
 'Vegetable condition': array(['fresh', 'scrap', 'avarage', 'scarp'], dtype=object)}

In [9]:
# Clean Vegetable condition column
df['Vegetable condition'] = df['Vegetable condition'].replace({'scarp': 'scrap'})

In [10]:
{column: df[column].unique() for column in df.select_dtypes('object').columns}

{'Vegetable': array(['potato', 'tomato ', 'peas', 'pumkin', 'cucumber',
        'pointed grourd ', 'Raddish', 'Bitter gourd', 'onion', 'garlic',
        'cabage', 'califlower', 'chilly', 'okra', 'brinjal', 'ginger',
        'radish'], dtype=object),
 'Season': array(['winter', 'summer', 'monsoon', 'autumn', 'spring'], dtype=object),
 'Month': array(['jan', 'apr', 'july', 'sept', 'oct', 'dec', 'may', 'aug', 'june',
        ' ', 'march'], dtype=object),
 'Vegetable condition': array(['fresh', 'scrap', 'avarage'], dtype=object)}

In [11]:
# Ordinal encoding
month_mapping = {
    'jan': 1,
    'feb': 2,
    'march': 3,
    'apr': 4,
    'may': 5,
    'june': 6,
    'july': 7,
    'aug': 8,
    'sept': 9,
    'oct': 10,
    'nov': 11,
    'dec': 12,
    ' ': np.nan
}

df['Month'] = df['Month'].replace(month_mapping)

In [12]:
{column: df[column].unique() for column in df.select_dtypes('object').columns}

{'Vegetable': array(['potato', 'tomato ', 'peas', 'pumkin', 'cucumber',
        'pointed grourd ', 'Raddish', 'Bitter gourd', 'onion', 'garlic',
        'cabage', 'califlower', 'chilly', 'okra', 'brinjal', 'ginger',
        'radish'], dtype=object),
 'Season': array(['winter', 'summer', 'monsoon', 'autumn', 'spring'], dtype=object),
 'Vegetable condition': array(['fresh', 'scrap', 'avarage'], dtype=object)}

In [13]:
df

Unnamed: 0,Vegetable,Season,Month,Temp,Deasaster Happen in last 3month,Vegetable condition,Price per kg
0,potato,winter,1.0,15,0,fresh,20
1,tomato,winter,1.0,15,0,fresh,50
2,peas,winter,1.0,15,0,fresh,70
3,pumkin,winter,1.0,15,0,fresh,25
4,cucumber,winter,1.0,15,0,fresh,20
...,...,...,...,...,...,...,...
116,brinjal,winter,1.0,15,1,fresh,33
117,ginger,winter,1.0,15,0,fresh,88
118,potato,summer,4.0,32,0,fresh,24
119,peas,summer,4.0,33,0,fresh,33


In [14]:
df.isna().sum()

Vegetable                          0
Season                             0
Month                              3
Temp                               0
Deasaster Happen in last 3month    0
Vegetable condition                0
Price per kg                       0
dtype: int64

In [15]:
# Fill missing month values with column mode
df['Month'] = df['Month'].fillna(df['Month'].mode()[0])

In [16]:
df.isna().sum().sum()

np.int64(0)

In [17]:
{column: df[column].unique() for column in df.select_dtypes('object').columns}

{'Vegetable': array(['potato', 'tomato ', 'peas', 'pumkin', 'cucumber',
        'pointed grourd ', 'Raddish', 'Bitter gourd', 'onion', 'garlic',
        'cabage', 'califlower', 'chilly', 'okra', 'brinjal', 'ginger',
        'radish'], dtype=object),
 'Season': array(['winter', 'summer', 'monsoon', 'autumn', 'spring'], dtype=object),
 'Vegetable condition': array(['fresh', 'scrap', 'avarage'], dtype=object)}

In [20]:
def onehot_encode(df, column):
    df = df.copy()
    dummies = pd.get_dummies(df[column], prefix=column, dtype=int)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(column, axis=1)
    return df

In [21]:
# One-hot encoding
for column in ['Vegetable', 'Season', 'Vegetable condition']:
    df = onehot_encode(df, column)

In [22]:
df

Unnamed: 0,Month,Temp,Deasaster Happen in last 3month,Price per kg,Vegetable_Bitter gourd,Vegetable_Raddish,Vegetable_brinjal,Vegetable_cabage,Vegetable_califlower,Vegetable_chilly,...,Vegetable_radish,Vegetable_tomato,Season_autumn,Season_monsoon,Season_spring,Season_summer,Season_winter,Vegetable condition_avarage,Vegetable condition_fresh,Vegetable condition_scrap
0,1.0,15,0,20,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
1,1.0,15,0,50,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,1,0
2,1.0,15,0,70,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
3,1.0,15,0,25,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
4,1.0,15,0,20,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
116,1.0,15,1,33,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,1,0
117,1.0,15,0,88,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
118,4.0,32,0,24,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
119,4.0,33,0,33,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0


In [23]:
# Split df into X and y
y = df['Price per kg']
X = df.drop('Price per kg', axis=1)

In [24]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)

In [25]:
X_train

Unnamed: 0,Month,Temp,Deasaster Happen in last 3month,Vegetable_Bitter gourd,Vegetable_Raddish,Vegetable_brinjal,Vegetable_cabage,Vegetable_califlower,Vegetable_chilly,Vegetable_cucumber,...,Vegetable_radish,Vegetable_tomato,Season_autumn,Season_monsoon,Season_spring,Season_summer,Season_winter,Vegetable condition_avarage,Vegetable condition_fresh,Vegetable condition_scrap
80,1.0,26,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
38,7.0,30,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
19,4.0,33,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
120,4.0,32,0,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,1,0
27,4.0,38,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9,1.0,15,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,1,0
72,12.0,15,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
12,1.0,15,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,1,0
107,12.0,21,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1


In [26]:
y_train

80      32
38     250
19     100
120      9
27      20
      ... 
9       45
72      10
12      20
107     32
37      40
Name: Price per kg, Length: 84, dtype: int64

In [27]:
# Scale X
scaler = StandardScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)

In [28]:
X_train

Unnamed: 0,Month,Temp,Deasaster Happen in last 3month,Vegetable_Bitter gourd,Vegetable_Raddish,Vegetable_brinjal,Vegetable_cabage,Vegetable_califlower,Vegetable_chilly,Vegetable_cucumber,...,Vegetable_radish,Vegetable_tomato,Season_autumn,Season_monsoon,Season_spring,Season_summer,Season_winter,Vegetable condition_avarage,Vegetable condition_fresh,Vegetable condition_scrap
80,-0.807171,0.135584,-0.614043,-0.251577,-0.251577,-0.156174,-0.301511,-0.301511,-0.156174,-0.156174,...,-0.223607,-0.346410,-0.156174,-0.408248,-0.19245,-0.669534,1.024100,2.236068,-1.452966,-0.427900
38,1.067549,0.569452,1.628550,-0.251577,-0.251577,-0.156174,-0.301511,-0.301511,-0.156174,-0.156174,...,-0.223607,-0.346410,-0.156174,2.449490,-0.19245,-0.669534,-0.976467,-0.447214,0.688247,-0.427900
19,0.130189,0.894854,-0.614043,-0.251577,-0.251577,-0.156174,-0.301511,-0.301511,-0.156174,-0.156174,...,-0.223607,-0.346410,-0.156174,-0.408248,-0.19245,1.493576,-0.976467,-0.447214,0.688247,-0.427900
120,0.130189,0.786387,-0.614043,-0.251577,-0.251577,-0.156174,-0.301511,-0.301511,-0.156174,-0.156174,...,4.472136,-0.346410,-0.156174,-0.408248,-0.19245,1.493576,-0.976467,-0.447214,0.688247,-0.427900
27,0.130189,1.437189,-0.614043,-0.251577,-0.251577,-0.156174,-0.301511,-0.301511,-0.156174,-0.156174,...,-0.223607,-0.346410,-0.156174,-0.408248,-0.19245,1.493576,-0.976467,-0.447214,0.688247,-0.427900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9,-0.807171,-1.057554,-0.614043,-0.251577,-0.251577,-0.156174,-0.301511,-0.301511,-0.156174,-0.156174,...,-0.223607,2.886751,-0.156174,-0.408248,-0.19245,-0.669534,1.024100,-0.447214,0.688247,-0.427900
72,2.629816,-1.057554,-0.614043,-0.251577,-0.251577,-0.156174,-0.301511,-0.301511,-0.156174,-0.156174,...,-0.223607,-0.346410,-0.156174,-0.408248,-0.19245,-0.669534,1.024100,-0.447214,-1.452966,2.336993
12,-0.807171,-1.057554,-0.614043,-0.251577,-0.251577,-0.156174,-0.301511,3.316625,-0.156174,-0.156174,...,-0.223607,-0.346410,-0.156174,-0.408248,-0.19245,-0.669534,1.024100,-0.447214,0.688247,-0.427900
107,2.629816,-0.406752,-0.614043,3.974921,-0.251577,-0.156174,-0.301511,-0.301511,-0.156174,-0.156174,...,-0.223607,-0.346410,-0.156174,-0.408248,-0.19245,-0.669534,1.024100,-0.447214,-1.452966,2.336993


In [29]:
X_train.mean()

Month                             -2.907727e-17
Temp                               1.982541e-17
Deasaster Happen in last 3month    1.057355e-17
Vegetable_Bitter gourd             8.855350e-17
Vegetable_Raddish                  7.401487e-17
Vegetable_brinjal                 -2.907727e-17
Vegetable_cabage                  -3.172066e-17
Vegetable_califlower              -1.321694e-18
Vegetable_chilly                  -5.022437e-17
Vegetable_cucumber                -3.965082e-17
Vegetable_garlic                   3.766828e-17
Vegetable_ginger                  -5.286776e-17
Vegetable_okra                    -3.965082e-17
Vegetable_onion                   -9.516197e-17
Vegetable_peas                    -4.229421e-17
Vegetable_pointed grourd           6.344132e-17
Vegetable_potato                  -6.344132e-17
Vegetable_pumkin                   9.251859e-18
Vegetable_radish                  -5.352861e-17
Vegetable_tomato                   7.533656e-17
Season_autumn                     -6.079

In [30]:
X_train.var()

Month                              1.012048
Temp                               1.012048
Deasaster Happen in last 3month    1.012048
Vegetable_Bitter gourd             1.012048
Vegetable_Raddish                  1.012048
Vegetable_brinjal                  1.012048
Vegetable_cabage                   1.012048
Vegetable_califlower               1.012048
Vegetable_chilly                   1.012048
Vegetable_cucumber                 1.012048
Vegetable_garlic                   1.012048
Vegetable_ginger                   1.012048
Vegetable_okra                     1.012048
Vegetable_onion                    1.012048
Vegetable_peas                     1.012048
Vegetable_pointed grourd           1.012048
Vegetable_potato                   1.012048
Vegetable_pumkin                   1.012048
Vegetable_radish                   1.012048
Vegetable_tomato                   1.012048
Season_autumn                      1.012048
Season_monsoon                     1.012048
Season_spring                   

### Training

In [31]:
models = {
    "                     Linear Regression": LinearRegression(),
    " Linear Regression (L2 Regularization)": Ridge(),
    " Linear Regression (L1 Regularization)": Lasso(),
    "                   K-Nearest Neighbors": KNeighborsRegressor(),
    "                        Neural Network": MLPRegressor(),
    "Support Vector Machine (Linear Kernel)": LinearSVR(),
    "   Support Vector Machine (RBF Kernel)": SVR(),
    "                         Decision Tree": DecisionTreeRegressor(),
    "                         Random Forest": RandomForestRegressor(),
    "                     Gradient Boosting": GradientBoostingRegressor(),
    "                               XGBoost": XGBRegressor(),
    "                              LightGBM": LGBMRegressor(),
    "                              CatBoost": CatBoostRegressor(verbose=0)
}

In [33]:
for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " Model trained.")

                     Linear Regression Model trained.
 Linear Regression (L2 Regularization) Model trained.
 Linear Regression (L1 Regularization) Model trained.
                   K-Nearest Neighbors Model trained.
                        Neural Network Model trained.
Support Vector Machine (Linear Kernel) Model trained.
   Support Vector Machine (RBF Kernel) Model trained.
                         Decision Tree Model trained.
                         Random Forest Model trained.
                     Gradient Boosting Model trained.
                               XGBoost Model trained.
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000027 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 30
[LightGBM] [Info] Number of data points in the train set: 84, number of used features: 6
[LightGBM] [Info] Start training from score 55.333333
     

### Results

In [35]:
for name, model in models.items():
    print(name + " R^2 Score: {:.5f}".format(model.score(X_test, y_test)))

                     Linear Regression R^2 Score: 0.71175
 Linear Regression (L2 Regularization) R^2 Score: 0.71197
 Linear Regression (L1 Regularization) R^2 Score: 0.70461
                   K-Nearest Neighbors R^2 Score: 0.23805
                        Neural Network R^2 Score: -0.19528
Support Vector Machine (Linear Kernel) R^2 Score: 0.41874
   Support Vector Machine (RBF Kernel) R^2 Score: -0.12694
                         Decision Tree R^2 Score: 0.60808
                         Random Forest R^2 Score: 0.61513
                     Gradient Boosting R^2 Score: 0.62848
                               XGBoost R^2 Score: 0.63068
                              LightGBM R^2 Score: 0.15834
                              CatBoost R^2 Score: 0.62010


In [36]:
len(y_test)

37