# Food Price Data Source

[WFP Food Prices Kenya Dataset](https://data.humdata.org/dataset/wfp-food-prices-for-kenya)

# Rainfall Data Source
[WFP Rainfall Kenya Dataset](https://data.humdata.org/dataset/ken-rainfall-subnational)

# -------------------------------------------------------------------------------------------






In [102]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# DATA CLEANING

In [103]:
!pip install pyspark



In [104]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('FoodPricePrediction').master('local[*]').getOrCreate()

spark.sparkContext.appName

'FoodPricePrediction'

In [105]:
data = spark.read.csv("drive/MyDrive/Distributed-Food-Price-Prediction-for-Kenyan-Markets/data/wfp_food_prices_ken_data.csv",inferSchema=True,header=True)
data.printSchema()

root
 |-- date: string (nullable = true)
 |-- region: string (nullable = true)
 |-- county: string (nullable = true)
 |-- market: string (nullable = true)
 |-- category: string (nullable = true)
 |-- commodity: string (nullable = true)
 |-- unit: string (nullable = true)
 |-- pricetype: string (nullable = true)
 |-- price: double (nullable = true)



In [106]:
data.show(5)

+---------+-------+--------+--------+------------------+-------------+-----+---------+------+
|     date| region|  county|  market|          category|    commodity| unit|pricetype| price|
+---------+-------+--------+--------+------------------+-------------+-----+---------+------+
|1/15/2006|  Coast| Mombasa| Mombasa|cereals and tubers|        Maize|   KG|Wholesale| 16.13|
|1/15/2006|Eastern|   Kitui|   Kitui|cereals and tubers|      Sorghum|90 KG|Wholesale|1800.0|
|1/15/2006|Eastern|   Kitui|   Kitui|   pulses and nuts|  Beans (dry)|   KG|   Retail|  39.0|
|1/15/2006|Eastern|Marsabit|Marsabit|cereals and tubers|Maize (white)|   KG|   Retail|  21.0|
|1/15/2006|Nairobi| Nairobi| Nairobi|cereals and tubers|        Bread|400 G|   Retail|  26.0|
+---------+-------+--------+--------+------------------+-------------+-----+---------+------+
only showing top 5 rows



In [107]:
rainfall = spark.read.csv("drive/MyDrive/Distributed-Food-Price-Prediction-for-Kenyan-Markets/data/ken-rainfall-data.csv",inferSchema=True,header=True)
rainfall.printSchema()

root
 |-- date: string (nullable = true)
 |-- rainfall_mm: double (nullable = true)



In [108]:
rainfall.show(5)

+---------+-----------+
|     date|rainfall_mm|
+---------+-----------+
| 1/1/1981|       NULL|
|1/11/1981|       NULL|
|1/21/1981|       NULL|
| 2/1/1981|       NULL|
|2/11/1981|       NULL|
+---------+-----------+
only showing top 5 rows



In [109]:
from pyspark.sql.functions import to_date

# Convert date strings to DateType
data = data.withColumn("date", to_date("date", "M/d/yyyy"))
rainfall = rainfall.withColumn("date", to_date("date", "M/d/yyyy"))


In [110]:
from pyspark.sql import functions as F

# Split the 'date' column into 'month' and 'year'
data1 = data.withColumn('month', F.month('date')) \
                       .withColumn('year', F.year('date'))
data1 = data1.drop('date')

data1.show(5)


+-------+--------+--------+------------------+-------------+-----+---------+------+-----+----+
| region|  county|  market|          category|    commodity| unit|pricetype| price|month|year|
+-------+--------+--------+------------------+-------------+-----+---------+------+-----+----+
|  Coast| Mombasa| Mombasa|cereals and tubers|        Maize|   KG|Wholesale| 16.13|    1|2006|
|Eastern|   Kitui|   Kitui|cereals and tubers|      Sorghum|90 KG|Wholesale|1800.0|    1|2006|
|Eastern|   Kitui|   Kitui|   pulses and nuts|  Beans (dry)|   KG|   Retail|  39.0|    1|2006|
|Eastern|Marsabit|Marsabit|cereals and tubers|Maize (white)|   KG|   Retail|  21.0|    1|2006|
|Nairobi| Nairobi| Nairobi|cereals and tubers|        Bread|400 G|   Retail|  26.0|    1|2006|
+-------+--------+--------+------------------+-------------+-----+---------+------+-----+----+
only showing top 5 rows



In [111]:
from pyspark.sql.functions import col, sum

# Count nulls in each column for 'data' DataFrame
null_counts = data.select([sum(col(c).isNull().cast("int")).alias(c) for c in data.columns])
null_counts.show()

# Count nulls in each column for 'rainfall' DataFrame
rainfall_null_counts = rainfall.select([sum(col(c).isNull().cast("int")).alias(c) for c in rainfall.columns])
rainfall_null_counts.show()

+----+------+------+------+--------+---------+----+---------+-----+
|date|region|county|market|category|commodity|unit|pricetype|price|
+----+------+------+------+--------+---------+----+---------+-----+
|   0|    44|    44|     0|       0|        0|   0|        0|    0|
+----+------+------+------+--------+---------+----+---------+-----+

+----+-----------+
|date|rainfall_mm|
+----+-----------+
|   0|        584|
+----+-----------+



In [112]:
data_clean = data.dropna()
rainfall_clean = rainfall.dropna()

In [113]:
data_clean.count(), len(data.columns)

(12702, 9)

In [114]:
rainfall_clean.count(), len(rainfall.columns)

(115705, 2)

In [115]:
data_clean.show(5)

+----------+-------+--------+--------+------------------+-------------+-----+---------+------+
|      date| region|  county|  market|          category|    commodity| unit|pricetype| price|
+----------+-------+--------+--------+------------------+-------------+-----+---------+------+
|2006-01-15|  Coast| Mombasa| Mombasa|cereals and tubers|        Maize|   KG|Wholesale| 16.13|
|2006-01-15|Eastern|   Kitui|   Kitui|cereals and tubers|      Sorghum|90 KG|Wholesale|1800.0|
|2006-01-15|Eastern|   Kitui|   Kitui|   pulses and nuts|  Beans (dry)|   KG|   Retail|  39.0|
|2006-01-15|Eastern|Marsabit|Marsabit|cereals and tubers|Maize (white)|   KG|   Retail|  21.0|
|2006-01-15|Nairobi| Nairobi| Nairobi|cereals and tubers|        Bread|400 G|   Retail|  26.0|
+----------+-------+--------+--------+------------------+-------------+-----+---------+------+
only showing top 5 rows



In [116]:
rainfall_clean.show(5)

+----------+-----------+
|      date|rainfall_mm|
+----------+-----------+
|1981-03-21|   266.3542|
|1981-04-01|     360.75|
|1981-04-11|      542.5|
|1981-04-21|   608.1042|
|1981-05-01|   767.2083|
+----------+-----------+
only showing top 5 rows



In [117]:
from pyspark.sql import functions as F

# Split the 'date' column into 'month' and 'year'
data_clean = data_clean.withColumn('month', F.month('date')) \
                       .withColumn('year', F.year('date'))

# Drop the 'date' column
data_clean = data_clean.drop('date')

# Filter years between 2014 and 2024 (inclusive)
data_clean = data_clean.filter((data_clean.year >= 2014) & (data_clean.year <= 2024))

# Sort by year and month
data_clean = data_clean.orderBy("year", "month")

data_clean.show(5)

+-------+--------+--------+------------------+-------------+-----+---------+------+-----+----+
| region|  county|  market|          category|    commodity| unit|pricetype| price|month|year|
+-------+--------+--------+------------------+-------------+-----+---------+------+-----+----+
|  Coast| Mombasa| Mombasa|cereals and tubers|        Maize|   KG|Wholesale| 38.44|    1|2014|
|  Coast| Mombasa| Mombasa|   pulses and nuts|        Beans|   KG|Wholesale| 79.99|    1|2014|
|  Coast| Mombasa| Mombasa|   pulses and nuts|  Beans (dry)|90 KG|Wholesale|5738.0|    1|2014|
|Eastern|   Kitui|   Kitui|   pulses and nuts|  Beans (dry)|   KG|   Retail|  74.0|    1|2014|
|Eastern|Marsabit|Marsabit|cereals and tubers|Maize (white)|   KG|   Retail| 53.36|    1|2014|
+-------+--------+--------+------------------+-------------+-----+---------+------+-----+----+
only showing top 5 rows



In [118]:
from pyspark.sql import functions as F
from pyspark.sql.functions import avg, round

# Split the 'date' column into 'month' and 'year'
rainfall_clean = rainfall_clean.withColumn('month', F.month('date')) \
                               .withColumn('year', F.year('date'))

# Drop the 'date' column
rainfall_clean = rainfall_clean.drop('date')

# Group by year and month, and calculate average rainfall rounded to 2 decimal places
rainfall_clean = rainfall_clean.groupBy("year", "month").agg(
    round(avg("rainfall_mm"), 2).alias("avg_rainfall_mm")
)

# Filter years between 2014 and 2024 (inclusive)
rainfall_clean = rainfall_clean.filter((rainfall_clean.year >= 2014) & (rainfall_clean.year <= 2024))

# Sort by year and month
rainfall_clean = rainfall_clean.orderBy("year", "month")

# Show first 5 rows
rainfall_clean.show(5)

+----+-----+---------------+
|year|month|avg_rainfall_mm|
+----+-----+---------------+
|2014|    1|         259.33|
|2014|    2|         201.62|
|2014|    3|         184.92|
|2014|    4|         269.88|
|2014|    5|         339.76|
+----+-----+---------------+
only showing top 5 rows



In [119]:
# Join market price data with rainfall data on year and month
food_price_data = data_clean.join(
    rainfall_clean,
    on=["year", "month"],  # Join keys
    how="left"
)

# Show sample of the joined result
food_price_data.show(5)

+----+-----+-------+--------+--------+------------------+-------------+-----+---------+------+---------------+
|year|month| region|  county|  market|          category|    commodity| unit|pricetype| price|avg_rainfall_mm|
+----+-----+-------+--------+--------+------------------+-------------+-----+---------+------+---------------+
|2014|    1|  Coast| Mombasa| Mombasa|cereals and tubers|        Maize|   KG|Wholesale| 38.44|         259.33|
|2014|    1|  Coast| Mombasa| Mombasa|   pulses and nuts|        Beans|   KG|Wholesale| 79.99|         259.33|
|2014|    1|  Coast| Mombasa| Mombasa|   pulses and nuts|  Beans (dry)|90 KG|Wholesale|5738.0|         259.33|
|2014|    1|Eastern|   Kitui|   Kitui|   pulses and nuts|  Beans (dry)|   KG|   Retail|  74.0|         259.33|
|2014|    1|Eastern|Marsabit|Marsabit|cereals and tubers|Maize (white)|   KG|   Retail| 53.36|         259.33|
+----+-----+-------+--------+--------+------------------+-------------+-----+---------+------+---------------+
o

In [120]:
from pyspark.sql.functions import col, sum

# Count nulls
nulls = food_price_data.select([sum(col(c).isNull().cast("int")).alias(c) for c in food_price_data.columns])
nulls.show()

+----+-----+------+------+------+--------+---------+----+---------+-----+---------------+
|year|month|region|county|market|category|commodity|unit|pricetype|price|avg_rainfall_mm|
+----+-----+------+------+------+--------+---------+----+---------+-----+---------------+
|   0|    0|     0|     0|     0|       0|        0|   0|        0|    0|              0|
+----+-----+------+------+------+--------+---------+----+---------+-----+---------------+



In [121]:
for column in food_price_data.columns:
    print(f"\nUnique values for column: {column}")
    food_price_data.select(column).distinct().show(truncate=False)


Unique values for column: year
+----+
|year|
+----+
|2018|
|2015|
|2023|
|2022|
|2014|
|2019|
|2020|
|2016|
|2024|
|2017|
|2021|
+----+


Unique values for column: month
+-----+
|month|
+-----+
|12   |
|1    |
|6    |
|3    |
|5    |
|9    |
|4    |
|8    |
|7    |
|10   |
|11   |
|2    |
+-----+


Unique values for column: region
+-------------+
|region       |
+-------------+
|Rift Valley  |
|Eastern      |
|North Eastern|
|Nyanza       |
|Coast        |
|Central      |
|Nairobi      |
+-------------+


Unique values for column: county
+-----------+
|county     |
+-----------+
|Uasin Gishu|
|Nakuru     |
|Mandera    |
|Kisumu     |
|Marsabit   |
|Wajir      |
|Kajiado    |
|Turkana    |
|Mombasa    |
|Kwale      |
|Makueni    |
|Meru South |
|Garissa    |
|Nairobi    |
|Isiolo     |
|Kitui      |
|Kilifi     |
|Baringo    |
|West Pokot |
|Nyeri      |
+-----------+
only showing top 20 rows


Unique values for column: market
+-------------------------------+
|market                  

In [123]:
from pyspark.sql.functions import udf, col, round as spark_round, log1p
from pyspark.sql.types import DoubleType

# List of units to drop
units_to_drop = ["Unit", "Bunch", "Head"]
food_price_data = food_price_data.filter(~col("unit").isin(units_to_drop))

# Define conversion factors
unit_conversion = {
    '400 G': 2.5,
    '64 KG': 1/64,
    'L': 1,
    '200 G': 5,
    '50 KG': 1/50,
    '13 KG': 1/13,
    '90 KG': 1/90,
    '200 ML': 5,
    '126 KG': 1/126,
    'KG': 1,
    '500 ML': 2
}

# Create UDF
def normalize_price(price, unit):
    factor = unit_conversion.get(unit, 1)
    return price * factor

normalize_price_udf = udf(normalize_price, DoubleType())

# Apply normalization
food_price_data = food_price_data.withColumn("normalized_price", normalize_price_udf(col("price"), col("unit")))

# Round to 2 decimal places
food_price_data = food_price_data.withColumn("normalized_price", spark_round(col("normalized_price"), 2))

# Optional: log transform to reduce skewness
food_price_data = food_price_data.withColumn("log_normalized_price", log1p(col("normalized_price")))

# Show first 5 rows
food_price_data.select("price", "unit", "normalized_price", "log_normalized_price").show(5, truncate=False)

+------+-----+----------------+--------------------+
|price |unit |normalized_price|log_normalized_price|
+------+-----+----------------+--------------------+
|38.44 |KG   |38.44           |3.6747805297344347  |
|79.99 |KG   |79.99           |4.3943256902608985  |
|5738.0|90 KG|63.76           |4.170688128809434   |
|74.0  |KG   |74.0            |4.31748811353631    |
|53.36 |KG   |53.36           |3.995628589282943   |
+------+-----+----------------+--------------------+
only showing top 5 rows



In [124]:
food_price_data.show(4)

+----+-----+-------+-------+-------+------------------+-----------+-----+---------+------+---------------+----------------+--------------------+
|year|month| region| county| market|          category|  commodity| unit|pricetype| price|avg_rainfall_mm|normalized_price|log_normalized_price|
+----+-----+-------+-------+-------+------------------+-----------+-----+---------+------+---------------+----------------+--------------------+
|2014|    1|  Coast|Mombasa|Mombasa|cereals and tubers|      Maize|   KG|Wholesale| 38.44|         259.33|           38.44|  3.6747805297344347|
|2014|    1|  Coast|Mombasa|Mombasa|   pulses and nuts|      Beans|   KG|Wholesale| 79.99|         259.33|           79.99|  4.3943256902608985|
|2014|    1|  Coast|Mombasa|Mombasa|   pulses and nuts|Beans (dry)|90 KG|Wholesale|5738.0|         259.33|           63.76|   4.170688128809434|
|2014|    1|Eastern|  Kitui|  Kitui|   pulses and nuts|Beans (dry)|   KG|   Retail|  74.0|         259.33|            74.0|    4.3

In [125]:
output_path = "drive/MyDrive/Distributed-Food-Price-Prediction-for-Kenyan-Markets/data/FoodPriceData"

food_price_data.write \
    .option("header", "true") \
    .mode("overwrite") \
    .csv(output_path)

# MODELLING

In [135]:
path = "drive/MyDrive/Distributed-Food-Price-Prediction-for-Kenyan-Markets/data/FoodPriceData"
food_price_data = spark.read.csv(path, inferSchema=True, header=True)

In [136]:
food_price_data.show(5)

+----+-----+-------+--------+--------+------------------+-------------+-----+---------+------+---------------+----------------+--------------------+
|year|month| region|  county|  market|          category|    commodity| unit|pricetype| price|avg_rainfall_mm|normalized_price|log_normalized_price|
+----+-----+-------+--------+--------+------------------+-------------+-----+---------+------+---------------+----------------+--------------------+
|2014|    1|  Coast| Mombasa| Mombasa|cereals and tubers|        Maize|   KG|Wholesale| 38.44|         259.33|           38.44|  3.6747805297344347|
|2014|    1|  Coast| Mombasa| Mombasa|   pulses and nuts|        Beans|   KG|Wholesale| 79.99|         259.33|           79.99|  4.3943256902608985|
|2014|    1|  Coast| Mombasa| Mombasa|   pulses and nuts|  Beans (dry)|90 KG|Wholesale|5738.0|         259.33|           63.76|   4.170688128809434|
|2014|    1|Eastern|   Kitui|   Kitui|   pulses and nuts|  Beans (dry)|   KG|   Retail|  74.0|         259

In [137]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml import PipelineModel

In [138]:
# Drop nulls from essential columns
model_data = food_price_data.dropna(subset=[
    "region", "county", "market", "category", "commodity", "unit", "pricetype",
    "log_normalized_price", "avg_rainfall_mm"
])


In [139]:
# Index categorical columns
categorical_cols = ["region", "county", "market", "category", "commodity", "unit", "pricetype"]
indexers = [StringIndexer(inputCol=col, outputCol=col+"_indexed", handleInvalid="keep") for col in categorical_cols]

In [140]:
# Assemble features
feature_cols = [col+"_indexed" for col in categorical_cols] + ["month", "year", "avg_rainfall_mm"]
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")

In [141]:
# Define model
dt = DecisionTreeRegressor(featuresCol="features", labelCol="log_normalized_price")

In [142]:
# Build pipeline
pipeline = Pipeline(stages=indexers + [assembler, dt])

In [143]:
# Split the data
train_data, test_data = model_data.randomSplit([0.8, 0.2], seed=42)

In [144]:
# Set up evaluator
evaluator = RegressionEvaluator(
    labelCol="log_normalized_price",
    predictionCol="prediction",
    metricName="rmse"
)

In [145]:
# Cross-validation with param grid
paramGrid = ParamGridBuilder() \
    .addGrid(dt.maxDepth, [3, 5, 10]) \
    .addGrid(dt.maxBins, [32, 64]) \
    .build()

cv = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=3
)

In [146]:
# Train model with CV
cv_model = cv.fit(train_data)

IllegalArgumentException: requirement failed: DecisionTree requires maxBins (= 32) to be at least as large as the number of values in each categorical feature, but categorical feature 2 has 62 values. Consider removing this and other categorical features with a large number of values, or add more training examples.

# GIt Version Control Setup

# brc0d3s (dev Branch)

In [None]:
%cd /content/drive/MyDrive/Distributed-Food-Price-Prediction-for-Kenyan-Markets

/content/drive/MyDrive/Distributed-Food-Price-Prediction-for-Kenyan-Markets


In [None]:
!git pull origin main

remote: Enumerating objects: 17, done.[K
remote: Counting objects:   5% (1/17)[Kremote: Counting objects:  11% (2/17)[Kremote: Counting objects:  17% (3/17)[Kremote: Counting objects:  23% (4/17)[Kremote: Counting objects:  29% (5/17)[Kremote: Counting objects:  35% (6/17)[Kremote: Counting objects:  41% (7/17)[Kremote: Counting objects:  47% (8/17)[Kremote: Counting objects:  52% (9/17)[Kremote: Counting objects:  58% (10/17)[Kremote: Counting objects:  64% (11/17)[Kremote: Counting objects:  70% (12/17)[Kremote: Counting objects:  76% (13/17)[Kremote: Counting objects:  82% (14/17)[Kremote: Counting objects:  88% (15/17)[Kremote: Counting objects:  94% (16/17)[Kremote: Counting objects: 100% (17/17)[Kremote: Counting objects: 100% (17/17), done.[K
remote: Compressing objects:  20% (1/5)[Kremote: Compressing objects:  40% (2/5)[Kremote: Compressing objects:  60% (3/5)[Kremote: Compressing objects:  80% (4/5)[Kremote: Compressing objects: 100

In [None]:
!git add .

In [None]:
!git config --global user.email "brc0d3s@gmail.com"
!git config --global user.name "brc0d3s"

In [None]:
!git commit -m "model"

[dev a13a79d] model
 88 files changed, 10 insertions(+), 10 deletions(-)
 rewrite Food_Price_Prediction.ipynb (90%)
 rename data/cleaned_data.csv/{.part-00000-fab14f22-c9e1-445c-aafe-a3d17882e985-c000.csv.crc => .part-00000-c27eb531-d0d4-443f-9895-0c7e946efb22-c000.csv.crc} (100%)
 rename data/cleaned_data.csv/{part-00000-fab14f22-c9e1-445c-aafe-a3d17882e985-c000.csv => part-00000-c27eb531-d0d4-443f-9895-0c7e946efb22-c000.csv} (100%)
 rewrite models/gbt_price_prediction_model/metadata/part-00000 (100%)
 rename models/gbt_price_prediction_model/stages/{0_StringIndexer_458939cc39e8 => 0_StringIndexer_25159a3be5af}/data/._SUCCESS.crc (100%)
 create mode 100644 models/gbt_price_prediction_model/stages/0_StringIndexer_25159a3be5af/data/.part-00000-4908dc4d-bb59-44e7-a042-ade909dd9070-c000.snappy.parquet.crc
 rename models/gbt_price_prediction_model/stages/{0_StringIndexer_458939cc39e8 => 0_StringIndexer_25159a3be5af}/data/_SUCCESS (100%)
 rename models/gbt_price_prediction_model/stages/{0_S

In [None]:
!git push origin dev

remote: Invalid username or password.
fatal: Authentication failed for 'https://github.com/brc0d3s/Distributed-Food-Price-Prediction-for-Kenyan-Markets.git/'


# barth123 (barth Branch)

In [None]:
%cd /content/drive/MyDrive/Distributed-Food-Price-Prediction-for-Kenyan-Markets

/content/drive/MyDrive/Distributed-Food-Price-Prediction-for-Kenyan-Markets


In [None]:
!git pull

From https://github.com/brc0d3s/Distributed-Food-Price-Prediction-for-Kenyan-Markets
 * [new branch]      barth      -> origin/barth
Already up to date.


In [None]:
!git add .

In [None]:
!git commit -m "Data Cleaning"

On branch dev
Your branch is ahead of 'origin/dev' by 5 commits.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean


In [None]:
!git push origin dev

remote: Invalid username or password.
fatal: Authentication failed for 'https://github.com/brc0d3s/Distributed-Food-Price-Prediction-for-Kenyan-Markets.git/'
