## Exploring the data

- Source: **Food and Agriculture Organization of the United Nations** [FAOSTAT](https://www.fao.org/faostat/en/#data/QCL)
- This is a time series data that consists of crop and livestock statistics

In [1]:
from pyspark.sql import SparkSession
import urllib.request
import zipfile
import os

In [2]:
spark = SparkSession.builder.appName("FAOSTAT Data").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/10/31 15:41:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Reading data

- First we are going to read the data from the source URL and we are going to extract it 

In [3]:
file_url = "https://bulks-faostat.fao.org/production/Production_Crops_Livestock_E_All_Data.zip"
local_zip_path = "/tmp/Production_Crops_Livestock_E_All_Data.zip"
urllib.request.urlretrieve(file_url, local_zip_path)

('/tmp/Production_Crops_Livestock_E_All_Data.zip',
 <http.client.HTTPMessage at 0x7f0c4a1058d0>)

In [4]:
with zipfile.ZipFile(local_zip_path, 'r') as zip_ref:
    zip_ref.extractall("/tmp/data")

In [5]:
csv_file_path = "/tmp/data/Production_Crops_Livestock_E_All_Data.csv"
df = spark.read.csv(csv_file_path, inferSchema=True, header=True)

                                                                                

In [6]:
df.printSchema()

root
 |-- Area Code: integer (nullable = true)
 |-- Area Code (M49): string (nullable = true)
 |-- Area: string (nullable = true)
 |-- Item Code: integer (nullable = true)
 |-- Item Code (CPC): string (nullable = true)
 |-- Item: string (nullable = true)
 |-- Element Code: integer (nullable = true)
 |-- Element: string (nullable = true)
 |-- Unit: string (nullable = true)
 |-- Y1961: double (nullable = true)
 |-- Y1961F: string (nullable = true)
 |-- Y1961N: string (nullable = true)
 |-- Y1962: double (nullable = true)
 |-- Y1962F: string (nullable = true)
 |-- Y1962N: string (nullable = true)
 |-- Y1963: double (nullable = true)
 |-- Y1963F: string (nullable = true)
 |-- Y1963N: string (nullable = true)
 |-- Y1964: double (nullable = true)
 |-- Y1964F: string (nullable = true)
 |-- Y1964N: string (nullable = true)
 |-- Y1965: double (nullable = true)
 |-- Y1965F: string (nullable = true)
 |-- Y1965N: string (nullable = true)
 |-- Y1966: double (nullable = true)
 |-- Y1966F: string (nu

In [7]:
df.head(1)

24/10/31 15:41:48 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


[Row(Area Code=2, Area Code (M49)="'004", Area='Afghanistan', Item Code=221, Item Code (CPC)="'01371", Item='Almonds, in shell', Element Code=5312, Element='Area harvested', Unit='ha', Y1961=0.0, Y1961F='A', Y1961N=None, Y1962=0.0, Y1962F='A', Y1962N=None, Y1963=0.0, Y1963F='A', Y1963N=None, Y1964=0.0, Y1964F='A', Y1964N=None, Y1965=0.0, Y1965F='A', Y1965N=None, Y1966=0.0, Y1966F='A', Y1966N=None, Y1967=0.0, Y1967F='A', Y1967N=None, Y1968=0.0, Y1968F='A', Y1968N=None, Y1969=0.0, Y1969F='A', Y1969N=None, Y1970=0.0, Y1970F='A', Y1970N=None, Y1971=0.0, Y1971F='A', Y1971N=None, Y1972=0.0, Y1972F='A', Y1972N=None, Y1973=0.0, Y1973F='A', Y1973N=None, Y1974=0.0, Y1974F='A', Y1974N=None, Y1975=0.0, Y1975F='E', Y1975N=None, Y1976=5900.0, Y1976F='E', Y1976N=None, Y1977=6000.0, Y1977F='E', Y1977N=None, Y1978=6000.0, Y1978F='E', Y1978N=None, Y1979=6000.0, Y1979F='E', Y1979N=None, Y1980=5800.0, Y1980F='E', Y1980N=None, Y1981=5800.0, Y1981F='E', Y1981N=None, Y1982=5800.0, Y1982F='E', Y1982N=None, Y1

In [13]:
df.groupBy("Item").count().show(20)

+--------------------+-----+
|                Item|count|
+--------------------+-----+
|  Butter of cow milk|  161|
|           Olive oil|   69|
|           Whey, dry|   78|
|Rapeseed or canol...|  105|
|         Canary seed|   94|
|Meat of other dom...|   21|
|                Bees|  163|
|             Turkeys|  124|
| Sheep and Goat Meat|  703|
|          Pineapples|  370|
|            Potatoes|  606|
|Roots and Tubers,...|  721|
|Edible offals of ...|  233|
|Peppermint, spear...|   73|
|   Other pome fruits|   92|
|Raw hides and ski...|  672|
|Edible offal of b...|  115|
|        Green garlic|  427|
|       Coffee, green|  329|
|Onions and shallo...|  303|
+--------------------+-----+
only showing top 20 rows



In [9]:
df.filter("Item=='Almonds, in shell'").select('Area').show(df.count(), False)

+---------------------------------------+
|Area                                   |
+---------------------------------------+
|Afghanistan                            |
|Afghanistan                            |
|Afghanistan                            |
|Algeria                                |
|Algeria                                |
|Algeria                                |
|Argentina                              |
|Argentina                              |
|Argentina                              |
|Armenia                                |
|Armenia                                |
|Australia                              |
|Australia                              |
|Australia                              |
|Azerbaijan                             |
|Azerbaijan                             |
|Azerbaijan                             |
|Belgium                                |
|Belgium                                |
|Bosnia and Herzegovina                 |
|Bosnia and Herzegovina           

## Element and Item codes and their meaning

- As we could see from our data, each item has an **Element** and an **Item** column, let's see in more detail what this column means

In [10]:
elements = spark.read.csv("data/elements.csv", inferSchema=True, header=True)

In [11]:
elements.printSchema()

root
 |-- Domain Code: string (nullable = true)
 |-- Domain: string (nullable = true)
 |-- Element Code: string (nullable = true)
 |-- Element: string (nullable = true)
 |-- Unit: string (nullable = true)



In [12]:
elements.head(5)

[Row(Domain Code='GCE', Domain='Climate Change: Agrifood systems emissions: Emissions from Crops', Element Code='5312', Element='Area harvested', Unit='ha'),
 Row(Domain Code='GCE', Domain='Climate Change: Agrifood systems emissions: Emissions from Crops', Element Code='7245', Element='Burning crop residues (Biomass burned, dry matter)', Unit='t'),
 Row(Domain Code='GCE', Domain='Climate Change: Agrifood systems emissions: Emissions from Crops', Element Code='72257', Element='Burning crop residues (Emissions CH4)', Unit='kt'),
 Row(Domain Code='GCE', Domain='Climate Change: Agrifood systems emissions: Emissions from Crops', Element Code='72307', Element='Burning crop residues (Emissions N2O)', Unit='kt'),
 Row(Domain Code='GCE', Domain='Climate Change: Agrifood systems emissions: Emissions from Crops', Element Code='72342', Element='Crop residues (Direct emissions N2O)', Unit='kt')]

In [14]:
items = spark.read.csv("data/item.csv", inferSchema=True, header=True)

In [15]:
items.printSchema()

root
 |-- Domain Code: string (nullable = true)
 |-- Domain: string (nullable = true)
 |-- Item Code: string (nullable = true)
 |-- Item: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- CPC Code: string (nullable = true)
 |-- HS Code: string (nullable = true)
 |-- HS07 Code: string (nullable = true)
 |-- HS12 Code: string (nullable = true)



In [16]:
items.head(5)

[Row(Domain Code='GCE', Domain='Climate Change: Agrifood systems emissions: Emissions from Crops', Item Code='1712', Item='All Crops', Description=None, CPC Code='F1712', HS Code=None, HS07 Code=None, HS12 Code=None),
 Row(Domain Code='GCE', Domain='Climate Change: Agrifood systems emissions: Emissions from Crops', Item Code='44', Item='Barley', Description='Barley', CPC Code='0115', HS Code=None, HS07 Code='100300', HS12 Code='100310, 100390'),
 Row(Domain Code='GCE', Domain='Climate Change: Agrifood systems emissions: Emissions from Crops', Item Code='176', Item='Beans, dry', Description='Beans, dry This subclass includes: -  beans, species of Phaseolus (vulgaris, lunatus, angularis, aureus, etc.) -  beans, species of Vigna (angularis, mungo, radiata, unguiculata, etc.) This subclass does not include: - soya beans, cf. 0141 - green beans, cf. 01241 - lentils, green, cf. 01249 -  bean shoots and sprouts, cf. 01290 - locust beans (carobs), cf. 01356 - castor beans, cf. 01449 -  broad b

## Year column

- As we can see the **Year** column has values such as: Y2022, Y2022F, Y2022N
- The **F** stands for Flag
- The **N** stands for Notes, indicates additional notes for a specific year, indicating if there is something unusual about the data for that specific year
- Given the first row of our data, we can observe **Y2004N: Unofficial figure** meaning this data was not reported by official data sources, it may have been provided from alternative sources or estimated

In [18]:
flags = spark.read.csv("data/flags.csv", inferSchema=True, header=True)

In [19]:
flags.printSchema()

root
 |-- Flag: string (nullable = true)
 |-- Flags: string (nullable = true)



In [20]:
flags.head(5)

[Row(Flag='A', Flags='Official figure'),
 Row(Flag='B', Flags='Time series break'),
 Row(Flag='E', Flags='Estimated value'),
 Row(Flag='F', Flags='Forecast value'),
 Row(Flag='I', Flags='Imputed value')]

In [25]:
df.head(1)[0]["Y2004N"]

'Unofficial figure'