# Food Price Data Source

[WFP Food Prices Kenya Dataset](https://data.humdata.org/dataset/e0d3fba6-f9a2-45d7-b949-140c455197ff/resource/517ee1bf-2437-4f8c-aa1b-cb9925b9d437/download/wfp_food_prices_ken.csv)

# -------------------------------------------------------------------------------------------






# DATA CLEANING

In [None]:
%pip install pyspark



In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('FoodPricePrediction').master('local[*]').getOrCreate()

spark.sparkContext.appName

25/04/02 11:06:04 WARN Utils: Your hostname, codespaces-ebd91c resolves to a loopback address: 127.0.0.1; using 10.0.1.231 instead (on interface eth0)
25/04/02 11:06:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/02 11:06:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/04/02 11:06:06 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/04/02 11:06:06 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


'FoodPricePrediction'

25/04/02 11:06:23 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


In [8]:
data = spark.read.csv("drive/MyDrive/Distributed-Food-Price-Prediction-for-Kenyan-Markets/data/wfp_food_prices_ken.csv",inferSchema=True,header=True)
data.printSchema()

root
 |-- date: string (nullable = true)
 |-- admin1: string (nullable = true)
 |-- admin2: string (nullable = true)
 |-- market: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- category: string (nullable = true)
 |-- commodity: string (nullable = true)
 |-- unit: string (nullable = true)
 |-- priceflag: string (nullable = true)
 |-- pricetype: string (nullable = true)
 |-- currency: string (nullable = true)
 |-- price: string (nullable = true)
 |-- usdprice: string (nullable = true)



In [9]:
data.show(5)

+----------+----------+----------+----------------+--------+---------+------------------+-------------+----------+----------------+----------------+---------+------+----------+
|      date|    admin1|    admin2|          market|latitude|longitude|          category|    commodity|      unit|       priceflag|       pricetype| currency| price|  usdprice|
+----------+----------+----------+----------------+--------+---------+------------------+-------------+----------+----------------+----------------+---------+------+----------+
|     #date|#adm1+name|#adm2+name|#loc+market+name|#geo+lat| #geo+lon|        #item+type|   #item+name|#item+unit|#item+price+flag|#item+price+type|#currency|#value|#value+usd|
|2006-01-15|     Coast|   Mombasa|         Mombasa|   -4.05|39.666667|cereals and tubers|        Maize|        KG|          actual|       Wholesale|      KES| 16.13|    0.2235|
|2006-01-15|     Coast|   Mombasa|         Mombasa|   -4.05|39.666667|cereals and tubers|Maize (white)|     90 KG| 

In [None]:
data1 = data.withColumnRenamed("admin1", "region").withColumnRenamed("admin2", "county")

In [11]:
data1.show(2)

+----------+----------+----------+----------------+--------+---------+------------------+----------+----------+----------------+----------------+---------+------+----------+
|      date|    region|    county|          market|latitude|longitude|          category| commodity|      unit|       priceflag|       pricetype| currency| price|  usdprice|
+----------+----------+----------+----------------+--------+---------+------------------+----------+----------+----------------+----------------+---------+------+----------+
|     #date|#adm1+name|#adm2+name|#loc+market+name|#geo+lat| #geo+lon|        #item+type|#item+name|#item+unit|#item+price+flag|#item+price+type|#currency|#value|#value+usd|
|2006-01-15|     Coast|   Mombasa|         Mombasa|   -4.05|39.666667|cereals and tubers|     Maize|        KG|          actual|       Wholesale|      KES| 16.13|    0.2235|
+----------+----------+----------+----------------+--------+---------+------------------+----------+----------+----------------+--

In [None]:
from pyspark.sql.functions import col, to_date
from pyspark.sql.types import FloatType, DoubleType


data2 = data1.withColumn("date", to_date(col("date"), "yyyy-MM-dd")) \
             .withColumn("latitude", col("latitude").cast(DoubleType())) \
             .withColumn("longitude", col("longitude").cast(DoubleType())) \
             .withColumn("price", col("price").cast(FloatType())) \
             .withColumn("usdprice", col("usdprice").cast(FloatType()))

In [13]:
data2.printSchema()

root
 |-- date: date (nullable = true)
 |-- region: string (nullable = true)
 |-- county: string (nullable = true)
 |-- market: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- category: string (nullable = true)
 |-- commodity: string (nullable = true)
 |-- unit: string (nullable = true)
 |-- priceflag: string (nullable = true)
 |-- pricetype: string (nullable = true)
 |-- currency: string (nullable = true)
 |-- price: float (nullable = true)
 |-- usdprice: float (nullable = true)



In [None]:
data2.count(), len(data2.columns)

(12865, 14)

In [None]:
from pyspark.sql.functions import col, sum


null_counts = data2.select([sum(col(c).isNull().cast("int")).alias(c) for c in data2.columns])
null_counts.show()

+----+------+------+------+--------+---------+--------+---------+----+---------+---------+--------+-----+--------+
|date|region|county|market|latitude|longitude|category|commodity|unit|priceflag|pricetype|currency|price|usdprice|
+----+------+------+------+--------+---------+--------+---------+----+---------+---------+--------+-----+--------+
|   1|    40|    40|     0|      41|       41|       0|        0|   0|        0|        0|       0|    1|       1|
+----+------+------+------+--------+---------+--------+---------+----+---------+---------+--------+-----+--------+



In [None]:
data_clean = data2.dropna()

In [None]:
data_clean.count(), len(data_clean.columns)

(12824, 14)

In [None]:
output_path = "drive/MyDrive/Distributed-Food-Price-Prediction-for-Kenyan-Markets/data/cleaned_data.csv"
data_clean.write.csv(output_path, header=True)

# MODELLING

In [1]:
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.regression import RandomForestRegressor, GBTRegressor, LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml import Pipeline

In [5]:
path = "data/cleaned_data.csv"
food_data = spark.read.csv(path, inferSchema=True, header=True)

                                                                                

In [6]:
feature_cols = ['region', 'county', 'market', 'category', 'commodity', 'unit']

In [7]:
indexers = [StringIndexer(inputCol=col, outputCol=f"{col}_index", handleInvalid='keep') for col in feature_cols]

In [8]:
assembler = VectorAssembler(inputCols=[f"{col}_index" for col in feature_cols] + ['latitude', 'longitude'],
                            outputCol='features', handleInvalid='skip')

In [9]:
models = {
    'RandomForest': RandomForestRegressor(featuresCol='features', labelCol='usdprice', maxBins=100, numTrees=50),
    'GradientBoostedTree': GBTRegressor(featuresCol='features', labelCol='usdprice', maxBins=100),
    'LinearRegression': LinearRegression(featuresCol='features', labelCol='usdprice')
}

In [10]:
train_data, test_data = food_data.randomSplit([0.8, 0.2], seed=42)

In [11]:
best_model = None
best_rmse = float('inf')
best_name = ""

# Evaluate models
for name, model in models.items():
    pipeline = Pipeline(stages=indexers + [assembler, model])

    paramGrid = ParamGridBuilder().build()
    crossval = CrossValidator(estimator=pipeline,
                              estimatorParamMaps=paramGrid,
                              evaluator=RegressionEvaluator(labelCol='usdprice', predictionCol='prediction', metricName='rmse'),
                              numFolds=5)

    try:
        cv_model = crossval.fit(train_data)
        predictions = cv_model.transform(test_data)
        rmse = RegressionEvaluator(labelCol='usdprice', predictionCol='prediction', metricName='rmse').evaluate(predictions)

        print(f"{name} RMSE: {rmse}")

        if rmse < best_rmse:
            best_rmse = rmse
            best_model = cv_model.bestModel
            best_name = name

    except Exception as e:
        print(f"Failed to train {name} model: {e}")

print(f"Best model: {best_name} with RMSE: {best_rmse}")

                                                                                

RandomForest RMSE: 8.132408816549402


25/04/02 11:07:56 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
25/04/02 11:07:56 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS
ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
                          ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python/3.12.1/lib/python3.12/socket.py", line 707, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt


KeyboardInterrupt: 

In [None]:
best_gbt_model = cv_model.bestModel


model_path = "drive/MyDrive/Distributed-Food-Price-Prediction-for-Kenyan-Markets/models/gbt_price_prediction_model"
best_gbt_model.save(model_path)

print(f"Model saved to {model_path}")

Model saved to drive/MyDrive/Distributed-Food-Price-Prediction-for-Kenyan-Markets/models/gbt_price_prediction_model


In [13]:
from pyspark.sql import Row

In [14]:
sample_data = [
    Row(region="Coast", county="Mombasa", market="Mombasa", category="cereals and tubers", commodity="Maize", unit="KG", latitude=-4.05, longitude=39.666667)
]

In [15]:
sample_df = spark.createDataFrame(sample_data)

In [16]:
indexers = [StringIndexer(inputCol=col, outputCol=f"{col}_index", handleInvalid='keep') for col in ['region', 'county', 'market', 'category', 'commodity', 'unit']]

In [17]:
assembler = VectorAssembler(inputCols=[f"{col}_index" for col in ['region', 'county', 'market', 'category', 'commodity', 'unit']] + ['latitude', 'longitude'],
                            outputCol='features', handleInvalid='skip')

In [18]:
pipeline = Pipeline(stages=indexers + [assembler])

In [19]:
pipeline_model = pipeline.fit(sample_df)
transformed_sample_df = pipeline_model.transform(sample_df)

                                                                                

In [20]:
predictions = best_gbt_model.transform(transformed_sample_df)

NameError: name 'best_gbt_model' is not defined

In [None]:
predictions.select('prediction').show()

# GIt Version Control Setup

# brc0d3s (dev Branch)

In [49]:
%cd /content/drive/MyDrive/Distributed-Food-Price-Prediction-for-Kenyan-Markets

/content/drive/MyDrive/Distributed-Food-Price-Prediction-for-Kenyan-Markets


In [50]:
!git pull origin main

remote: Enumerating objects: 1, done.[K
remote: Counting objects: 100% (1/1)[Kremote: Counting objects: 100% (1/1), done.[K
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)[K
Unpacking objects: 100% (1/1)Unpacking objects: 100% (1/1), 889 bytes | 88.00 KiB/s, done.
From https://github.com/brc0d3s/Distributed-Food-Price-Prediction-for-Kenyan-Markets
 * branch            main       -> FETCH_HEAD
   900cd52..e00d7eb  main       -> origin/main
[33mhint: You have divergent branches and need to specify how to reconcile them.[m
[33mhint: You can do so by running one of the following commands sometime before[m
[33mhint: your next pull:[m
[33mhint: [m
[33mhint:   git config pull.rebase false  # merge (the default strategy)[m
[33mhint:   git config pull.rebase true   # rebase[m
[33mhint:   git config pull.ff only       # fast-forward only[m
[33mhint: [m
[33mhint: You can replace "git config" with "git config --global" to set a default[m
[33mhint: pre

In [56]:
!git add .

In [57]:
!git config --global user.email "brc0d3s@gmail.com"
!git config --global user.name "brc0d3s"

In [58]:
!git commit -m "model"

[dev 114d1e6] model
 65 files changed, 10 insertions(+), 1 deletion(-)
 create mode 100644 models/gbt_price_prediction_model/metadata/._SUCCESS.crc
 create mode 100644 models/gbt_price_prediction_model/metadata/.part-00000.crc
 create mode 100644 models/gbt_price_prediction_model/metadata/_SUCCESS
 create mode 100644 models/gbt_price_prediction_model/metadata/part-00000
 create mode 100644 models/gbt_price_prediction_model/stages/0_StringIndexer_458939cc39e8/data/._SUCCESS.crc
 create mode 100644 models/gbt_price_prediction_model/stages/0_StringIndexer_458939cc39e8/data/.part-00000-0c4ae22f-c12a-475c-8067-c26b2ffbb2e4-c000.snappy.parquet.crc
 create mode 100644 models/gbt_price_prediction_model/stages/0_StringIndexer_458939cc39e8/data/_SUCCESS
 create mode 100644 models/gbt_price_prediction_model/stages/0_StringIndexer_458939cc39e8/data/part-00000-0c4ae22f-c12a-475c-8067-c26b2ffbb2e4-c000.snappy.parquet
 create mode 100644 models/gbt_price_prediction_model/stages/0_StringIndexer_458939

In [59]:
!git push origin dev

Enumerating objects: 66, done.
Counting objects: 100% (66/66), done.
Delta compression using up to 2 threads
Compressing objects: 100% (45/45), done.
Writing objects: 100% (64/64), 10.17 KiB | 63.00 KiB/s, done.
Total 64 (delta 17), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (17/17), completed with 2 local objects.[K
To https://github.com/brc0d3s/Distributed-Food-Price-Prediction-for-Kenyan-Markets.git
   fa5272c..114d1e6  dev -> dev


# barth123 (barth Branch)

In [None]:
%cd /content/drive/MyDrive/Distributed-Food-Price-Prediction-for-Kenyan-Markets

/content/drive/MyDrive


In [None]:
!git pull

'Barth ATS standard Resume (1).pdf'	      housing.csv
'Barth ATS standard Resume.pdf'		      housing.gsheet
'barth cv.docx'				      IMG_20250121_172040_147.jpg
'BATHOLOMEN-OGUTU-NYONGESA-REPORT (1).docx'   Lab1.ipynb
'BATHOLOMEN-OGUTU-NYONGESA-REPORT (2).docx'   Relizane_Data.xlsx
 BATHOLOMEN-OGUTU-NYONGESA-REPORT.docx	     'Scan 01 Dec 21 · 03·59·05.pdf'
'Batholomew Nyongesa cv.docx'		     'Transcript-1046075 (2).pdf'
'Batholomew Nyongesa cv.pdf'		     'Transcript-1046075 (3).pdf'
 CoinbaseWalletBackups			      TrustWalletBackup
'Colab Notebooks'			      Untitled0.ipynb
 DMLLabworks.ipynb			     'Untitled document.gdoc'
'Getting started.pdf'			     'Untitled spreadsheet.gsheet'


In [None]:
!git add .

In [None]:
!git commit -m "Data Cleaning"

[dev 3056fb5] Data Cleaning
 3 files changed, 1 insertion(+), 1 deletion(-)
 create mode 100644 Abstract/ABSTRACT_GROUP20.docx
 create mode 100644 Abstract/ABSTRACT_GROUP20.pdf


In [None]:
!git push origin dev

Enumerating objects: 8, done.
Counting objects:  12% (1/8)Counting objects:  25% (2/8)Counting objects:  37% (3/8)Counting objects:  50% (4/8)Counting objects:  62% (5/8)Counting objects:  75% (6/8)Counting objects:  87% (7/8)Counting objects: 100% (8/8)Counting objects: 100% (8/8), done.
Delta compression using up to 2 threads
Compressing objects:  16% (1/6)Compressing objects:  33% (2/6)Compressing objects:  50% (3/6)Compressing objects:  66% (4/6)Compressing objects:  83% (5/6)Compressing objects: 100% (6/6)Compressing objects: 100% (6/6), done.
Writing objects:  16% (1/6)Writing objects:  33% (2/6)Writing objects:  50% (3/6)Writing objects:  66% (4/6)Writing objects:  83% (5/6)Writing objects: 100% (6/6)Writing objects: 100% (6/6), 114.97 KiB | 3.48 MiB/s, done.
Total 6 (delta 2), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas:   0% (0/2)[Kremote: Resolving deltas:  50% (1/2)[Kremote: Resolving deltas: 100% (2/2)[Kremote: Resolving deltas: 100%