## Pre-processing raw data of Nigeria food prices

Omdena-Kano Central, Nigeria Chapter (https://omdena.com/local-chapters/kano-nigeria-chapter/)

- <b>Project</b>: https://omdena.com/chapter-challenges/analysis-and-prediction-of-food-prices-in-nigeria-using-machine-learning-and-python/

- <b>Author</b>: Abhishek Dutta

<i> Credits to Anupama Mathpal for the raw dataset </i>

In [1]:
# Run the below command if you're running the code for the first time and don't have the polars library installed.
%pip install -q polars

Note: you may need to restart the kernel to use updated packages.


In [2]:
import sys
import polars as pl
from datetime import datetime

In [3]:
print(f'Python version used: {sys.version}')
print(f'Polars version used: {pl.__version__}')

Python version used: 3.9.17 (main, Jun 12 2023, 23:26:41) 
[Clang 14.0.3 (clang-1403.0.22.14.1)]
Polars version used: 0.19.8


### Reading the raw data file into a Polars dataframe

In [4]:
start = datetime.now()
data = pl.read_csv('wfp_food_prices_nga.csv', has_header=True)
print(f'Time taken for reading CSV file: {(datetime.now() - start).total_seconds()} seconds')
# Polars is much more efficient than Pandas for I/O(input/output) operations like reading CSV files and others...
# https://medium.com/cuenex/pandas-2-0-vs-polars-the-ultimate-battle-a378eb75d6d1

Time taken for reading CSV file: 0.055164 seconds


In [5]:
print(f'Shape of the dataset is: {data.shape[0]} rows and {data.shape[1]} columns')

Shape of the dataset is: 80982 rows and 14 columns


In [6]:
data.head(5)

date,admin1,admin2,market,latitude,longitude,category,commodity,unit,priceflag,pricetype,currency,price,usdprice
str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""#date""","""#adm1+name""","""#adm2+name""","""#loc+market+na…","""#geo+lat""","""#geo+lon""","""#item+type""","""#item+name""","""#item+unit""","""#item+price+fl…","""#item+price+ty…","""#currency""","""#value""","""#value+usd"""
"""2002-01-15""","""Katsina""","""Jibia""","""Jibia (CBM)""","""13.08""","""7.24""","""cereals and tu…","""Maize""","""KG""","""actual""","""Wholesale""","""NGN""","""175.92""","""1.5525"""
"""2002-01-15""","""Katsina""","""Jibia""","""Jibia (CBM)""","""13.08""","""7.24""","""cereals and tu…","""Millet""","""KG""","""actual""","""Wholesale""","""NGN""","""150.18""","""1.3254"""
"""2002-01-15""","""Katsina""","""Jibia""","""Jibia (CBM)""","""13.08""","""7.24""","""cereals and tu…","""Rice (imported…","""KG""","""actual""","""Wholesale""","""NGN""","""358.7""","""3.1656"""
"""2002-01-15""","""Katsina""","""Jibia""","""Jibia (CBM)""","""13.08""","""7.24""","""cereals and tu…","""Sorghum""","""KG""","""actual""","""Wholesale""","""NGN""","""155.61""","""1.3733"""


We see that the first record is actually sort of metadata about the columns themselves, so we'll skip that particular record and keep the data following it.

In [7]:
data = data[1:,:]
# Polars has no concept of an index: https://towardsdatascience.com/understand-polars-lack-of-indexes-526ea75e413
# The above command means that select all rows from row number 1 till the end, and the ':' after the comma means select all columns

In [8]:
print(f'Updated shape of the dataset is: {data.shape[0]} rows and {data.shape[1]} columns')

Updated shape of the dataset is: 80981 rows and 14 columns


In [9]:
print(f'Estimated memory usage of the dataframe: {round(data.estimated_size("mb"),3)} mb')

Estimated memory usage of the dataframe: 18.348 mb


### Fixing data types of individual columns

1. 'date' - Correct data type should be date
2. 'admin1', 'admin2' and 'market' - Correct data type should be categorical
3. 'latitude' and 'latitude' - Correct data type should be float
4. 'category', 'commodity', 'unit', 'priceflag', 'pricetype' and 'currency' - Correct data type should be categorical
5. 'price' and 'usdprice' - Correct data type should be float

In [10]:
data = data.select(
    pl.col('date').str.to_date("%Y-%m-%d"),
    pl.col('admin1').cast(pl.Categorical),
    pl.col('admin2').cast(pl.Categorical),
    pl.col('market').cast(pl.Categorical),
    pl.col('latitude').cast(pl.Float32),
    pl.col('longitude').cast(pl.Float32),
    pl.col('category').cast(pl.Categorical),
    pl.col('commodity').cast(pl.Categorical),
    pl.col('unit').cast(pl.Categorical),
    pl.col('priceflag').cast(pl.Categorical),
    pl.col('pricetype').cast(pl.Categorical),
    pl.col('currency').cast(pl.Categorical),
    pl.col('price').cast(pl.Float32),
    pl.col('usdprice').cast(pl.Float32)
)

In [11]:
print(f'Estimated memory usage of the dataframe after fixing data types of columns: {round(data.estimated_size("mb"),3)} mb')

Estimated memory usage of the dataframe after fixing data types of columns: 4.328 mb


### Check for duplicated records

In [12]:
data.filter(data.is_duplicated())

date,admin1,admin2,market,latitude,longitude,category,commodity,unit,priceflag,pricetype,currency,price,usdprice
date,cat,cat,cat,f32,f32,cat,cat,cat,cat,cat,cat,f32,f32


We see from the above that there are no duplicated records.

### Check for null values

In [13]:
data.null_count()

date,admin1,admin2,market,latitude,longitude,category,commodity,unit,priceflag,pricetype,currency,price,usdprice
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,0,0,0,0,0,0


We see from the above that are no null/missing values in any of the columns in the dataset

### Removing records where priceflag is 'forecast' or the price values is 0

These sort of records may not be useful for machine learning.

In [14]:
print(data['priceflag'].unique())

shape: (4,)
Series: 'priceflag' [cat]
[
	"actual"
	"actual,aggregate"
	"aggregate"
	"forecast"
]


In [15]:
data = data.filter(
    ~((data['priceflag']=='forecast') | (data['price']==0))
    )

### Drop column currency and rename price column to ngnprice

Currency column seems unnecessary. The price column can be renamed to ngnprice to denote the currency being considered.

In [16]:
data = data.drop('currency')

In [17]:
data = data.rename({'price':'ngnprice'})

In [18]:
data.head(5)

date,admin1,admin2,market,latitude,longitude,category,commodity,unit,priceflag,pricetype,ngnprice,usdprice
date,cat,cat,cat,f32,f32,cat,cat,cat,cat,cat,f32,f32
2002-01-15,"""Katsina""","""Jibia""","""Jibia (CBM)""",13.08,7.24,"""cereals and tu…","""Maize""","""KG""","""actual""","""Wholesale""",175.919998,1.5525
2002-01-15,"""Katsina""","""Jibia""","""Jibia (CBM)""",13.08,7.24,"""cereals and tu…","""Millet""","""KG""","""actual""","""Wholesale""",150.179993,1.3254
2002-01-15,"""Katsina""","""Jibia""","""Jibia (CBM)""",13.08,7.24,"""cereals and tu…","""Rice (imported…","""KG""","""actual""","""Wholesale""",358.700012,3.1656
2002-01-15,"""Katsina""","""Jibia""","""Jibia (CBM)""",13.08,7.24,"""cereals and tu…","""Sorghum""","""KG""","""actual""","""Wholesale""",155.610001,1.3733
2002-01-15,"""Katsina""","""Jibia""","""Jibia (CBM)""",13.08,7.24,"""pulses and nut…","""Beans (niebe)""","""KG""","""actual""","""Wholesale""",196.869995,1.7374


### Exporting processed dataset as both CSV and Parquet files

In [19]:
data.write_csv('wfp_food_prices_processed.csv', date_format="%d-%m-%Y")
data.write_parquet('wfp_food_prices_processed.parquet')

### Summary

Overall, the raw dataset was in good condition already. So, not a lot of pre-processing was required. Some of the steps were:
1. Checking for duplicated records and any null values - None were found
2. Removed records where price was 0 or priceflag value was 'forecast' - Not really useful for machine learning/EDA
3. Dropped currency column and included currency in the price column's name - Made sense and saved some memory as well