# Swiss Food Analysis

Spark - Group Assignment

Group E - **Alain Grullón, Emily Yorke, Julius von Selchow, William Kingwill, Sydne-Aline Strasser, Tarek ElNoury**

Data - https://campus.ie.edu/webapps/blackboard/content/listContent.jsp?course_id=_114341265_1&content_id=_2536547_1&mode=reset

## Agenda

**I. PySpark environment setup**

**II. Data source set-up**
  
**III. Explorative Data Analysis (EDA)**
  
**IV. Analysis**

    1. Oldest Product

    2. Newest Product

    3. Products Average Age

    4. Countries List

    5. Product Category

    6. Traces

    7. Data Quality Analysis

    8. Data Profiling

  
**V. Product Health Metrics**


**------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------**

In [12]:
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:90% !important; }</style>")) # Increase cell width
display(HTML("<style>.rendered_html { font-size: 14px; }</style>")) # Increase font size

## I. PySpark Environment Setup

In [13]:
import findspark
findspark.init()

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

## II. Data Source Read and DataFrame Setup

In [14]:
FoodDF = spark.read \
              .format("csv") \
              .option("header","true") \
              .option("inferSchema","true") \
              .option("path","en.openfoodfacts.org.products.Switzerland.csv.gz") \
              .load()

## III. Exploratory Data Analysis (EDA)

In [15]:
from IPython.display import display, Markdown

display(Markdown("This DataFrame has **%d rows**." % FoodDF.count()))
FoodDF.printSchema()

This DataFrame has **50963 rows**.

root
 |-- code: double (nullable = true)
 |-- url: string (nullable = true)
 |-- creator: string (nullable = true)
 |-- created_t: integer (nullable = true)
 |-- created_datetime: timestamp (nullable = true)
 |-- last_modified_t: integer (nullable = true)
 |-- last_modified_datetime: timestamp (nullable = true)
 |-- product_name: string (nullable = true)
 |-- generic_name: string (nullable = true)
 |-- quantity: string (nullable = true)
 |-- packaging: string (nullable = true)
 |-- packaging_tags: string (nullable = true)
 |-- brands: string (nullable = true)
 |-- brands_tags: string (nullable = true)
 |-- categories: string (nullable = true)
 |-- categories_tags: string (nullable = true)
 |-- categories_en: string (nullable = true)
 |-- origins: string (nullable = true)
 |-- origins_tags: string (nullable = true)
 |-- manufacturing_places: string (nullable = true)
 |-- manufacturing_places_tags: string (nullable = true)
 |-- labels: string (nullable = true)
 |-- labels_tags: string (n

In [16]:
FoodDF_wo_Date = FoodDF.drop("created_datetime","last_modified_datetime")

These are datetime columns and therefore we cannot check them for nulls.

In [17]:
from pyspark.sql.functions import isnull, isnan, when, count, col
FoodDF_wo_Date.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in FoodDF_wo_Date.columns]).show()

+----+---+-------+---------+---------------+------------+------------+--------+---------+--------------+------+-----------+----------+---------------+-------------+-------+------------+--------------------+-------------------------+------+-----------+---------+---------+--------------+------------------------+------+-----------+---------------+------+---------+--------------+------------+----------------+---------+------------+------+-----------+---------+------------+----------------+-------------+-----------+---------+--------------+------------+---------------------------+-------------------------+------------------------------+---------------------------------------+-------------------------------------+------------------------------------------+----------------+----------------+----------+-------------+-------------+------+-----------+---------+-----------+-------------+----------------+---------+---------------+---------------------+---------------------------+-------------------

So many nulls, this dataset should be cleaned big time. However for the purposes of this project which is data analysis and not data engineering / Machine Learning, we can extract insights from what is available.

Replace any column in red to find the amount of nulls in that columns

In [18]:
FoodDF.select([count(when(col('traces_en').isNull(), True)).alias("proteins nulls")]).show()

+--------------+
|proteins nulls|
+--------------+
|         46270|
+--------------+



In [19]:
FoodDF.cache() # optimization to make the processing faster
FoodDF.sample(False, 0.1).take(1)

[Row(code=11210009578.0, url='http://world-en.openfoodfacts.org/product/0011210009578/tabasco-mild', creator='walterppk', created_t=1552218617, created_datetime=datetime.datetime(2019, 3, 10, 12, 50, 17), last_modified_t=1556136868, last_modified_datetime=datetime.datetime(2019, 4, 24, 22, 14, 28), product_name='Tabasco Mild', generic_name=None, quantity='60 ml', packaging='bouteille en verre', packaging_tags='bouteille-en-verre', brands='Tabasco', brands_tags='tabasco', categories='Epicerie, Sauces, Sauces pimentées', categories_tags='en:groceries,en:sauces,en:pimented-sauces', categories_en='Groceries,Sauces,Pimented sauces', origins=None, origins_tags=None, manufacturing_places=None, manufacturing_places_tags=None, labels=None, labels_tags=None, labels_en=None, emb_codes=None, emb_codes_tags=None, first_packaging_code_geo=None, cities=None, cities_tags=None, purchase_places=None, stores=None, countries='en:CH', countries_tags='en:switzerland', countries_en='Switzerland', ingredients

In [20]:
FoodDF_sub1 = FoodDF.select("product_name","created_datetime","quantity","brands","categories",
                            "ingredients_text").where((col("product_name").isNotNull()) & 
                                                                      (col("quantity").isNotNull()) & 
                                                                      (col("brands").isNotNull()) &
                                                                      (col("categories").isNotNull()) & 
                                                                      (col("ingredients_text").isNotNull()))
FoodDF_sub1.show(10)

+--------------------+-------------------+---------+--------------------+--------------------+--------------------+
|        product_name|   created_datetime| quantity|              brands|          categories|    ingredients_text|
+--------------------+-------------------+---------+--------------------+--------------------+--------------------+
|            smelties|2019-01-24 20:06:50|      50g|             bonfood|  Aliments pour bébé|Semoule de mais b...|
|Enjoy Life Chewy ...|2018-11-07 22:16:26|    473ml|           herbalife|Groceries,Snacks,...|Tapioca Syrup, Ve...|
|      Dark chocolate|2018-04-07 14:43:13|   3.5 oz|               Lindt|Snacks, Sweet sna...|chocolate, cocoa ...|
|Ferrero, nutella,...|2013-07-02 16:14:24|     1 kg|     Ferrero,Nutella|Plant-based foods...|Sugar, palm oil, ...|
|Original pepper s...|2015-01-31 05:43:58|    59 mL|Tabasco,Mc. Ilhen...|Groceries, Sauces...|Distilled vinegar...|
|Tabasco Habanero ...|2016-06-17 05:20:10|    60 ml|             Tabasco

In [21]:
FoodDF_sub2 = FoodDF.select("product_name","countries_en","purchase_places","states",
                            "origins","manufacturing_places").where((col("product_name").isNotNull()) & 
                                                                      (col("quantity").isNotNull()) & 
                                                                      (col("cities_tags").isNotNull()) &
                                                                      (col("purchase_places").isNotNull()) & 
                                                                      (col("origins").isNotNull()) & 
                                                                      (col("manufacturing_places").isNotNull()))
FoodDF_sub2.show(10)

+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|        product_name|        countries_en|     purchase_places|              states|          origins|manufacturing_places|
+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|Le Jambon Supérie...|France,Réunion,Sw...|Angers,France,Cha...|en:to-be-checked,...| Union Européenne|LAMPAULAISE DES S...|
| Leche semidesnatada|Belgium,France,Ge...|              France|en:to-be-complete...|           France|              France|
|Lait facile à dig...|  France,Switzerland|       Carvin,France|en:to-be-checked,...|           France|              France|
|      Le Pâté Hénaff|France,Japan,Swit...|France,Villers Bo...|en:to-be-checked,...|  Porc : Bretagne|Jean Hénaff Produ...|
|             Salakis|Austria,Belgium,F...|Toulouse,France,D...|en:to-be-checked,...|       Frankreich|Société Fromagère...|


In [22]:
FoodDF_sub3 = FoodDF.select("product_name","nutriscore_score","nutriscore_grade","labels","saturated-fat_100g",
                            "proteins_100g","energy_100g").where((col("product_name").isNotNull()) & 
                                                                      (col("nutriscore_score").isNotNull()) & 
                                                                      (col("nutriscore_grade").isNotNull()) &
                                                                      (col("labels").isNotNull()) & 
                                                                      (col("saturated-fat_100g").isNotNull()) & 
                                                                      (col("proteins_100g").isNotNull()) & 
                                                                      (col("energy_100g").isNotNull()))
FoodDF_sub3.show(10)

+--------------------+----------------+----------------+--------------------+------------------+-------------+-----------+
|        product_name|nutriscore_score|nutriscore_grade|              labels|saturated-fat_100g|proteins_100g|energy_100g|
+--------------------+----------------+----------------+--------------------+------------------+-------------+-----------+
|Ferrero, nutella,...|              18|               d|  Fabriqué au Canada|             10.81|         5.41|     2264.0|
|Original pepper s...|               6|               c|         Gluten-free|               0.2|          0.0|       67.0|
|Tabasco Habanero ...|              13|               d|          Glutenvrij|               0.1|          1.5|      418.0|
|Chocolate Chunk D...|              21|               e|Kosher,Orthodox U...|              13.9|          3.8|     2131.0|
|Chocolate Chunk M...|              22|               e|Kascher, Orthodox...|              15.0|          5.6|     2197.0|
|Strawberry Sens

This last one can come in handy for the second part: Determining how healthy the different products are.

Some columns like this one, cities, have only nulls.

## IV. Analysis

### 1. Oldest Product

In [23]:
Oldest_Food_Age_DF = FoodDF.select("product_name","created_datetime","ingredients_text", "main_category_en"
                ).orderBy(col("created_datetime").asc())
Oldest_Food_Age_DF.show(1)

+--------------------+-------------------+--------------------+--------------------+
|        product_name|   created_datetime|    ingredients_text|    main_category_en|
+--------------------+-------------------+--------------------+--------------------+
|Lulu La Barquette...|2012-02-11 16:07:23|Sirop de glucose-...|fr:Barquettes à l...|
+--------------------+-------------------+--------------------+--------------------+
only showing top 1 row



The oldest product is Lulu La Barquette (Biscuits) from February 11th 2012 at 16:07

### 2. Newest Product

In [24]:
Newest_Food_Age_DF = FoodDF.select("product_name","created_datetime","labels","categories"
                ).sort(col("created_datetime").desc())
Newest_Food_Age_DF.show(1)

+------------+-------------------+------+----------+
|product_name|   created_datetime|labels|categories|
+------------+-------------------+------+----------+
| Pouletbrust|2020-09-12 23:53:23|  null|      null|
+------------+-------------------+------+----------+
only showing top 1 row



The newest product is Pouletbrust (chicken breast) from September 12th 2020 at 23:53

### 3. Average Product Age

In [25]:
from pyspark.sql.functions import current_timestamp, current_date, datediff
from pyspark.sql.functions import avg, mean

age = Newest_Food_Age_DF.withColumn("Age (Days)", datediff(current_timestamp(),col("created_datetime")))
age.show(5)

+--------------------+-------------------+------+----------+----------+
|        product_name|   created_datetime|labels|categories|Age (Days)|
+--------------------+-------------------+------+----------+----------+
|         Pouletbrust|2020-09-12 23:53:23|  null|      null|        33|
|                Feta|2020-09-12 22:29:57|  null|      null|        33|
|Mokku Mokku passi...|2020-09-12 22:27:45|  null|      null|        33|
|         Nasi Goreng|2020-09-12 21:32:32|  null|      null|        33|
|  Smooth Mix Mallows|2020-09-12 19:44:00|  null|      null|        33|
+--------------------+-------------------+------+----------+----------+
only showing top 5 rows



In [26]:
Oldest_Food_Age_DF.withColumn("Age (Days)", datediff(current_timestamp(),col("created_datetime"))).show(5)

+--------------------+-------------------+--------------------+--------------------+----------+
|        product_name|   created_datetime|    ingredients_text|    main_category_en|Age (Days)|
+--------------------+-------------------+--------------------+--------------------+----------+
|Lulu La Barquette...|2012-02-11 16:07:23|Sirop de glucose-...|fr:Barquettes à l...|      3169|
|Eau minérale natu...|2012-02-11 21:46:21|                 Eau|Carbonated minera...|      3169|
|Milka Noisette En...|2012-02-12 09:32:47|Sucre, _Noisettes...|Chocolates with h...|      3168|
|              Isio 4|2012-03-19 15:34:36|huile de colza, O...|Mixed vegetable oils|      3132|
|Sardines de Breta...|2012-03-19 19:03:09|Sardines (65%), e...|Sardines in tomat...|      3132|
+--------------------+-------------------+--------------------+--------------------+----------+
only showing top 5 rows



In [27]:
age.select(avg("Age (Days)").alias("Average Age (Days)")).show()

+------------------+
|Average Age (Days)|
+------------------+
| 872.4594902183937|
+------------------+



The average age of the products seems to be around 860 days or around 2.4 years.

### 4. List of Countries

I see herbalife has many many countries, lets see how many herbalife products we have.

In [28]:
FoodDF.select("brands_tags").filter(col("brands_tags").contains("herbalife")).count()

86

Ok, its not significant (86 / 50900), we can take them out.

In [29]:
FoodDF.select("countries_en","brands_tags").where(col("brands_tags").contains("herbalife")).show()

+--------------------+--------------------+
|        countries_en|         brands_tags|
+--------------------+--------------------+
|Canada,Germany,It...|           herbalife|
|Argentina-espanol...|           herbalife|
|Ecuador,European ...|           herbalife|
|France,Italy,Spai...|           herbalife|
|France,Italy,Spai...|           herbalife|
|Italy,Spain,Switz...|           herbalife|
|France,Italy,Arge...|           herbalife|
|Italy,Mexico,Spai...|           herbalife|
|France,Germany,It...|           herbalife|
|Italy,Spain,Unite...|           herbalife|
|Australia,Belgium...|           herbalife|
|France,Italy,Spai...|           herbalife|
|France,Italy,Spai...|           herbalife|
|Argentina-espanol...|           herbalife|
|Germany,Italy,Spa...|herbalife,eigenpr...|
|Brazil,France,Isr...|           herbalife|
|Italy,Mexico,Spai...|           herbalife|
|France,United Sta...|           herbalife|
|France,Italy,Spai...|           herbalife|
|France,Italy,Mexi...|          

In [30]:
FoodDF_noherba = FoodDF.select("countries_en","brands_tags").where(~col("brands_tags").contains("herbalife"))

In [31]:
FoodDF_noherba.select("brands_tags").filter(col("brands_tags").contains("herbalife")).count()

0

In [32]:
countries = FoodDF_noherba.select("countries_en")

In [33]:
from pyspark.sql.functions import split
CountriesFoodDF = countries.filter(col("countries_en")!="Switzerland").distinct()

In [34]:
import pyspark.sql.functions as f

countries_list = CountriesFoodDF.select(
        "countries_en",
        f.split("countries_en", ",").alias("countries"),
        f.posexplode(f.split("countries_en", ",")).alias("pos", "country")
    )
countries_list.show(5)

+--------------------+--------------------+---+-----------+
|        countries_en|           countries|pos|    country|
+--------------------+--------------------+---+-----------+
|  Canada,Switzerland|[Canada, Switzerl...|  0|     Canada|
|  Canada,Switzerland|[Canada, Switzerl...|  1|Switzerland|
|France,Germany,It...|[France, Germany,...|  0|     France|
|France,Germany,It...|[France, Germany,...|  1|    Germany|
|France,Germany,It...|[France, Germany,...|  2|      Italy|
+--------------------+--------------------+---+-----------+
only showing top 5 rows



Still in construction...

In [35]:
Countries_list = countries_list.select("country").where((~col("country").contains(":")) & 
                                                         (~col("country").contains("-")) &
                                                         (col("country")!="Switzerland") & 
                                                         (col("country")!="Frankreich") &
                                                         (col("country")!="European Union") & 
                                                         (col("country")!="En"))

Countries_List = Countries_list.dropDuplicates(['country'])

We make sure to eliminate those entries that are not actually countries like the ones specified in the code above.

In [36]:
countries_count = Countries_List.count()
print("There are " + str(countries_count) + " countries in this list ")

There are 97 countries in this list 


In [37]:
print("The Official List of other countries where food products are sold (apart from Switzerland)")

Countries_List.sort(col("country")).show(99)

The Official List of other countries where food products are sold (apart from Switzerland)
+--------------------+
|             country|
+--------------------+
|         Afghanistan|
|             Albania|
|             Algeria|
|             Andorra|
|           Argentina|
|               Aruba|
|           Australia|
|             Austria|
|             Belarus|
|            Belgique|
|             Belgium|
|Bosnia and Herzeg...|
|              Brazil|
|            Bulgaria|
|            Cambodia|
|            Cameroon|
|              Canada|
|               Chile|
|               China|
|            Colombia|
|          Costa Rica|
|             Croatia|
|                Cuba|
|              Cyprus|
|      Czech Republic|
|       Côte d'Ivoire|
|             Denmark|
|               Egypt|
|             Estonia|
|             Finland|
|              France|
|       French Guiana|
|    French Polynesia|
|               Gabon|
|            Galmudug|
|             Germany|
|           

The food products sold in Switzerland are sold in over 40% of the world's countries.

### 5. Identify category of products and compute:

Lets see how these categories look.

In [38]:
FoodCat_DF = FoodDF.select("product_name","main_category","main_category_en","categories","categories_tags","categories_en").where(
                                                                      (col("product_name").isNotNull()) & 
                                                                      (col("main_category").isNotNull()) & 
                                                                      (col("main_category_en").isNotNull()) &
                                                                      (col("categories").isNotNull()) & 
                                                                      (col("categories_tags").isNotNull()) & 
                                                                      (col("categories_en").isNotNull()))
FoodCat_DF.show(10)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|        product_name|       main_category|    main_category_en|          categories|     categories_tags|       categories_en|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|            smelties|       en:baby-foods|          Baby foods|  Aliments pour bébé|       en:baby-foods|          Baby foods|
|              Salmon|   en:salmon-fillets|      Salmon fillets|Productos del mar...|en:seafood,en:fis...|Seafood,Fishes,Fi...|
| Salade de lentilles|    en:lentil-salads|       Lentil salads|Plats préparés, S...|en:meals,en:salad...|Meals,Salads,Prep...|
|Enjoy Life Chewy ...|       en:supplement|          Supplement|Groceries,Snacks,...|en:groceries,en:s...|Groceries,Snacks,...|
|    Sauce bolognaise| en:bolognese-sauces|    Bolognese sauces|Groceries, Meat-b...|en:groceries,en:m..

First, lets see which one we prefer to use based on nulls, for reliability.

In [39]:
FoodDF.select([count(when(col('main_category').isNull(), True)).alias("main_category nulls")]).show()
FoodDF.select([count(when(col('main_category_en').isNull(), True)).alias("main_category_en nulls")]).show()
FoodDF.select([count(when(col('categories').isNull(), True)).alias("categories nulls")]).show()
FoodDF.select([count(when(col('categories_tags').isNull(), True)).alias("categories_tags nulls")]).show()
FoodDF.select([count(when(col('categories_en').isNull(), True)).alias("categories_en nulls")]).show()

+-------------------+
|main_category nulls|
+-------------------+
|              32914|
+-------------------+

+----------------------+
|main_category_en nulls|
+----------------------+
|                 32914|
+----------------------+

+----------------+
|categories nulls|
+----------------+
|           32914|
+----------------+

+---------------------+
|categories_tags nulls|
+---------------------+
|                32914|
+---------------------+

+-------------------+
|categories_en nulls|
+-------------------+
|              32914|
+-------------------+



They all have the same amount of nulls, how about in terms of looks: Lets choose main_category_en because it gives us the name in English and in one main category.

#### Number of products by category

Lets get a list of the main categories

In [40]:
categories = FoodDF.select("main_category_en").distinct()
categories.count()

3598

In [41]:
# Group by Main Category / count products
Categories = FoodDF.select(col("main_category_en").
                           alias("Main Category")
                          ).groupBy("Main Category"
                                   ).count().select("Main Category",
                                                    f.col("count"
                                                         ).alias("Number of Products")) 

Categories_DF = Categories.where(col("Main Category").isNotNull()) # show no nulls please

print("The Number of Products by Category") # title
Categories_DF.sort(col("Number of Products").desc()).show(20)

The Number of Products by Category
+--------------------+------------------+
|       Main Category|Number of Products|
+--------------------+------------------+
| Sweetened beverages|               356|
|      Swiss Gruyères|               212|
|            Biscuits|               178|
|             Yogurts|               178|
|Unsweetened bever...|               170|
|          Fruit-jams|               164|
|           Beverages|               140|
|             Bonbons|               121|
|     Dark chocolates|               118|
|   Breakfast cereals|               102|
|             Cheeses|               102|
|     Milk chocolates|               100|
|              Breads|               100|
|             Candies|                91|
|               Wines|                87|
|    Swiss chocolates|                85|
|              Sauces|                85|
|Extra-virgin oliv...|                84|
|          Chocolates|                80|
|               Meals|                79|

The top product category is Sweetened beverages with 356 products! 

The 2nd is Swiss Local Cheese with 212 different products!

The 3rd is a tie between biscuits and yogurts with 178 products each (getting healthier)!

Lets check out categories_en and see if there is a better way to categorize them. 

In [54]:
from pyspark.sql.functions import lower, col
Categories_df = FoodDF.select(col("categories_en").
                           alias("Categories")
                          ).groupBy("Categories"
                                   ).count().select("Categories",
                                                    f.col("count"
                                                         ).alias("Number of Products")) # Group by Main Category / count products
Categories_DF = Categories_df.where(col("Categories").isNotNull()) # show no nulls please

Categories_DF = Categories_DF.select(lower(col("Categories")).alias("category"),(col("Number of Products")))
Categories_DF = Categories_DF.distinct()

print("The Number of Products by Categories") # title
Categories_DF.sort(col("Number of Products").desc()).show(500)

The Number of Products by Categories
+--------------------+------------------+
|            category|Number of Products|
+--------------------+------------------+
|dairies,fermented...|               205|
|dairies,fermented...|               173|
|snacks,sweet snac...|               166|
|           beverages|               140|
|snacks,sweet snac...|               121|
|snacks,sweet snac...|               113|
|plant-based foods...|               102|
|dairies,fermented...|               100|
|plant-based foods...|               100|
|snacks,sweet snac...|                98|
|beverages,alcohol...|                87|
|snacks,sweet snac...|                85|
|plant-based foods...|                84|
|    groceries,sauces|                76|
|snacks,sweet snac...|                73|
|               meals|                71|
|groceries,sauces,...|                70|
|snacks,salty snac...|                69|
|beverages,alcohol...|                63|
|spreads,breakfast...|                6

In [55]:
categories_list_df = Categories_DF.select(
        "category","Number of Products",
        f.split("category", ",").alias("category"),
        f.posexplode(f.split("category", ",")).alias("pos", "cat")
    )
categories_list_df.show(10)

+--------------------+------------------+--------------------+---+--------------------+
|            category|Number of Products|            category|pos|                 cat|
+--------------------+------------------+--------------------+---+--------------------+
|plant-based foods...|                 1|[plant-based food...|  0|plant-based foods...|
|plant-based foods...|                 1|[plant-based food...|  1|   plant-based foods|
|plant-based foods...|                 1|[plant-based food...|  2|           groceries|
|plant-based foods...|                 1|[plant-based food...|  3|              snacks|
|plant-based foods...|                 1|[plant-based food...|  4|          condiments|
|plant-based foods...|                 1|[plant-based food...|  5|fruits and vegeta...|
|plant-based foods...|                 1|[plant-based food...|  6|        salty snacks|
|plant-based foods...|                 1|[plant-based food...|  7|          appetizers|
|plant-based foods...|          

In [56]:
categories_l = categories_list_df.select("cat","Number of Products").where(
    col("pos")==1) # change value X in <col("pos")==X> to see the different levels of categorization 
grouped_cat = categories_l.select("cat","Number of Products").groupBy("cat").sum("Number of Products")
dstnct_cat = grouped_cat.select("cat","sum(Number of Products)").orderBy(col("sum(Number of Products)").desc())
dstnct_cat.show(50)

+--------------------+-----------------------+
|                 cat|sum(Number of Products)|
+--------------------+-----------------------+
|   plant-based foods|                   4815|
|        sweet snacks|                   2554|
|     fermented foods|                   1684|
|           beverages|                   1197|
|              sauces|                    657|
| alcoholic beverages|                    628|
|      prepared meats|                    413|
|        salty snacks|                    339|
|          breakfasts|                    269|
|        frozen foods|                    263|
|              fishes|                    249|
|            desserts|                    226|
|          condiments|                    214|
|   carbonated drinks|                    196|
|               milks|                    151|
|              waters|                    150|
|               meals|                    145|
|              syrups|                    124|
|            

In this case, we get the first three categories as:

1. Plant-Based Foods with 4815 products

2. Sweet Snacks with 2554 products

3. Fermented Foods (Cheese) with 1684 products

#### List containing names of products by category

Lets take a look at some products inside the top 3 food categories (using main_categories):

First of course, lets take a look at **Sweet Beverages**:

In [48]:
SweetBev_DF = FoodDF.select("generic_name", "product_name"
                        ).where((col("main_category_en")=="Sweetened beverages"))

SweetBevDF = SweetBev_DF.where(col("generic_name").isNotNull() & col("product_name").isNotNull()).distinct()
print("               Sweet Beverages:")
SweetBevDF.show(50)

               Sweet Beverages:
+--------------------+--------------------+
|        generic_name|        product_name|
+--------------------+--------------------+
|Boisson rafraîchi...|    Volvic Thé Pêche|
|Boisson rafraîchi...|Coca-Cola Goût Or...|
|Boisson rafraichi...|Thé glacé saveur ...|
|Koffeinhaltige Li...|           Coca-Cola|
|Boisson rafraîchi...|            Té limón|
|Boisson gazeuse l...|Fanta, zero, soda...|
|Boisson rafraîchi...|Pineapple Juice D...|
|La Paille Magique...|Magic Sipper choc...|
|Boisson à l'eau m...|  Volvic zest citron|
|Boisson édulcorée...|Red Bull Energy D...|
|Boisson à l'eau m...|Volvic Juicy agru...|
|Boisson rafraîchi...|              Orange|
|Boisson de table ...|               Pomme|
|Boisson gazeuse r...|               Pepsi|
|Boisson rafraichi...|              Orange|
|Thé glacé (boisso...|Nestea  Mangue An...|
|Préparation en po...|             Nesquik|
|Infusion Instanta...|   Tisane bonne nuit|
|Boisson concentré...|Boisson concentré...|


Above we can see some of the beverages inside our top category. 

Now, lets look at the list of **Cheese**:

In [49]:
Cheese_DF = FoodDF.select("product_name"
                        ).where((col("main_category_en")=="Swiss Gruyères"))
#                            alias("Main Category")

CheeseDF = Cheese_DF.where(col("product_name").isNotNull()).distinct()
print("     Swiss Cheese:")
CheeseDF.show(50)

     Swiss Cheese:
+--------------------+
|        product_name|
+--------------------+
| Le gruyère Surchoix|
| Le Gruyère Surchoix|
|Le gruyère Switze...|
|     Le Gruyère salé|
|Le Gruyère Switze...|
|Gruyère AOP Bio S...|
|  Le Gruyère Rebibes|
|    Gruyère AOP Doux|
|      Gruyère jamadu|
|    Gruyère surchoix|
|Le Gruyère AOP Ka...|
|Le Gruyère  AOP salé|
|Le Gruyère switze...|
|     Fromage Gruyère|
|Le Gruyère AOP SU...|
|  Le Gruyère Mi-salé|
|Schweizer Hartkäs...|
|Gruyère corsé 12 ...|
|Le gruyère switze...|
|KALTBACH LE GRUYÉ...|
|Le Gruyère Switze...|
|          LE GRUYÈRE|
|Gruyère d'alpage AOP|
|        Gruyère Doux|
|          Höhlengold|
|Gruyère AOP surchoix|
|AOP LE GRUYERE SW...|
| Le Gruyère Reibkäse|
|        Gruyère râpé|
|          Le gruyère|
|Qualité & Prix Le...|
|Nussbrot le gruye...|
|Le Gruyère Switze...|
|         Gruyere AOP|
|Le Gruyère, AOP, ...|
|     Gruyère Mi-Salé|
|Qualité & Prix Le...|
|      Gruyère suisse|
|Le Gruyère  &quot...|
|             g

Above we can see a list of the different cheeses in Switzerland, I personally did not know there were so many names for cheese in the world.

How about those **Biscuits**:

In [50]:
Bisc_DF = FoodDF.select("product_name"
                        ).where((col("main_category_en")=="Biscuits"))
#                            alias("Main Category")

BiscDF = Bisc_DF.where(col("product_name").isNotNull()).distinct()
print("        Biscuits:")
BiscDF.show(50)

        Biscuits:
+--------------------+
|        product_name|
+--------------------+
|Passione italiana...|
|  Biscuits Prussiens|
|Coop Bio Cookies ...|
|            Biscotti|
|          Pepparkaka|
|           Speculoos|
|      Biscuit Sésame|
|  Biscuits au beurre|
|BelVita Miel et P...|
|Biscuit lait choc...|
|Petit beurre fram...|
|Belvita figues et...|
|  Belvita  Breakfast|
|           Digestive|
|  Biscuits BUTTERFLY|
|       Maria bolacha|
|            Japonais|
|Biscuits orange a...|
|  Savoiardi al cacao|
|Fórmula 1  alimen...|
|   Paste di mandorla|
|Abbracci fin caca...|
|Bonomi Italian Am...|
|Pure butter short...|
|Danesita Butter C...|
|Dar-vida break ap...|
|      Zitronenherzli|
|           Cooky N01|
|Luxury Cookies Do...|
|        Snac cracker|
|   Biscuits complets|
|Gut & Günstig coo...|
|Dar vida choco au...|
|Galletas espelta ...|
|             Biscoff|
|       Blévita curry|
|Biscuit complet a...|
|Savoiardi Laydfin...|
| Gran Cereale Frutta|
|   4 Biscuits S

Biscuits for all taste buds!

Finally, something healthy, **Yogurts**:

In [None]:
Yog_DF = FoodDF.select("product_name"
                        ).where((col("category_en")=="Yogurts"))
#                            alias("Main Category")

YogDF = Yog_DF.where(col("product_name").isNotNull()).distinct()
print("        Yogurts:")
YogDF.show(50)

178 types of yogurt, a bit boring.

Lets take a look at some products inside the top 3 food categories (using categories_en):

First of course, lets take a look at **Plant Based**:

In [63]:
Plant_DF = FoodDF.select("product_name"
                        ).where(col("categories_en").contains("plant-based foods"))
#                            alias("Main Category")

Plant_df = Plant_DF.where(col("product_name").isNotNull()).distinct()
print("  Plant Based Foods:")
Plant_df.show(50)

  Plant Based Foods:
+--------------------+
|        product_name|
+--------------------+
|           Mischobst|
|                Grão|
|Morceaux de pomme...|
|Raisins sultanine...|
|        Ananasstücke|
|Tomates séchées d...|
|   Tranches d'ananas|
|Baby Ananas tranches|
|     pomodori secchi|
|Petits croquants ...|
|    Cœurs De Palmier|
| Macédoine de fruits|
|             Rotkohl|
| Mélange Fruits Secs|
|Champignons mélan...|
|               Ajvar|
|Cassegrain Petit ...|
|    Mini épi de maïs|
|          Gemüsemais|
|    Griottes séchées|
|       Romarin séché|
|Golden Sweet Anan...|
|          Spreelinge|
|Champignons blanc...|
|Champignons au Vi...|
| Morceaux de mangues|
|Brocolis suisse, ...|
|  Maïs doux croquant|
|Légumes pour Cous...|
|         Fruits secs|
|       Aprikosen 161|
|   Baies de goji bio|
|       Bio Soy Beans|
|Giardiniera crocante|
|Haricots verts ex...|
|  Bouillon de légume|
|      Haricots Verts|
|Haricots Beurre E...|
|Rewe Best Wahl Po...|
|           F

In [68]:
Sweets_DF = FoodDF.select("product_name"
                        ).where(col("categories_en").contains("sweet"))
#                            alias("Main Category")

Sweets_df = Sweets_DF.where(col("product_name").isNotNull()).distinct()
print("     Sweet Snacks:")
Sweets_df.show(50)

     Sweet Snacks:
+--------------------+
|        product_name|
+--------------------+
|            Montcalm|
|     Verde Grean Tea|
|Citro FARMER, Zit...|
|     Raspberry Falls|
|Herbal Infusion C...|
|Petits croquants ...|
|Soja original sav...|
|Kinder Weihnachts...|
|  Japanese Green Tea|
|       Teisseire Max|
|          Gemüsemais|
|Jus Cranberry Lig...|
|Capsules de café ...|
|Refresco Pepsi La...|
|      Pepsi Cola Max|
|           Kokosnuss|
|Oatly Hafer Avena...|
|         La Salvetat|
|     Vinaigre de riz|
|  Maïs doux croquant|
|      Eau Cristaline|
|    Schweppes Agrum'|
|Green smoothie sp...|
|Black ice tea pea...|
|            Red bull|
|         Eau Perrier|
|   Espresso Classico|
| Schoko zimt mandeln|
|Mandelmilch ohne ...|
|   Dada saveur pêche|
|  Chococru Gingembre|
|Betteraves rouges...|
|     Happy Cola Zéro|
| Thé Orange Cannelle|
|Vollmilch ohne Zu...|
|           Pepsi Max|
| jus de citron light|
|     Ice Tea : Lemon|
|Coca-Cola Zéro Su...|
|        Lipton

In [70]:
Sweets_DF = FoodDF.select("product_name"
                        ).where(col("categories_en").contains("fermented"))
#                            alias("Main Category")

Sweets_df = Sweets_DF.where(col("product_name").isNotNull()).distinct()
print("    Fermented Foods:")
Sweets_df.show(50)

    Fermented Foods:
+--------------------+
|        product_name|
+--------------------+
|Dessert pulpe de ...|
|Bio Sojo Granatapfel|
|So soya ! Pêche -...|
|Jocos mango-passi...|
|Mango plant-based...|
|             So Soya|
|Alpro Vaniglia/ B...|
| Sojasun nature coco|
|          Alpro coco|
|Spécialité au soj...|
|  Nature aux Amandes|
|       Yaourt Mangue|
|  Nature sans sucres|
|Gourmand et végét...|
|Spécialité au soj...|
|Natürliche Bifind...|
|Harvest Moon Coco...|
|Postre de Soja Na...|
|   SOJA Soyog Fraise|
|Postre de soja co...|
|   Dessert Myrtilles|
|  Skyr Style Vanille|
|Kids - Fraise - P...|
|Yaourt aux fruits...|
|Bifidus Natural s...|
|Spécialité au soj...|
|     Sojasun Bifidus|
|               Jocos|
|Gourmand et Végét...|
|Made From Soya Na...|
+--------------------+



### 6. Identify Traces and Compute:

Similar to the approach with analyzing categories, we will see how the traces columns look before proceeding.

In [71]:
FoodTrac_DF = FoodDF.select("product_name","traces","traces_tags","traces_en","main_category_en").where(
                                                                      (col("traces").isNotNull()) & 
                                                                      (col("traces_tags").isNotNull()) & 
                                                                      (col("traces_en").isNotNull()))
FoodTrac_DF.show(50)

+--------------------+--------------------+--------------------+--------------------+--------------------+
|        product_name|              traces|         traces_tags|           traces_en|    main_category_en|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|            smelties|           fr:aucune|           fr:aucune|           fr:aucune|          Baby foods|
|stylo glaçage fuc...|           en:gluten|           en:gluten|              Gluten|fr:decoration-des...|
|             Almonds|             en:nuts|             en:nuts|                Nuts|             Almonds|
|  Mélange  randonnée|  en:nuts,en:peanuts|  en:nuts,en:peanuts|        Nuts,Peanuts|                null|
|Crunchy Canadian ...|en:nuts,en:peanut...|en:nuts,en:peanut...|Nuts,Peanuts,Soyb...|         Cereal bars|
|Crunchy Oats & Ho...|en:nuts,en:peanut...|en:nuts,en:peanut...|Nuts,Peanuts,Soyb...|         Cereal bars|
|dar-vida extra fi...|en:nuts,en:sesa

And also, to take a look at nulls to make a more informed decision when choosing the column to analyze.

In [72]:
FoodDF.select([count(when(col('traces').isNull(), True)).alias("traces nulls")]).show()
FoodDF.select([count(when(col('traces_tags').isNull(), True)).alias("traces_tags nulls")]).show()
FoodDF.select([count(when(col('traces_en').isNull(), True)).alias("traces_en nulls")]).show()

+------------+
|traces nulls|
+------------+
|       47584|
+------------+

+-----------------+
|traces_tags nulls|
+-----------------+
|            46270|
+-----------------+

+---------------+
|traces_en nulls|
+---------------+
|          46270|
+---------------+



Wow, just so many nulls, hopefully this means over 90% of the products do not have traces. Lets see what we can extract from here. 

Clearly, the best one to choose is traces_en because it gives us the trace in English. 

But we will have to delete the "aucune" observations as these are telling us there are no traces in that product (in French).

First, lets see how many products actually have traces:

In [73]:
count_auc = FoodTrac_DF.select("traces_en").where(col("traces_en").contains("aucune")).count()
print("only " + str(count_auc) + " products have 'none' in french")

only 14 products have 'none' in french


In [74]:
count_fr = FoodTrac_DF.select("traces_en").where(~col("traces_en").contains("aucune")).count()
print("Whoops! " + str(count_fr) + " out of 50900 have traces, and the other 46270 are null, what about the other thousand or so products?")

Whoops! 3365 out of 50900 have traces, and the other 46270 are null, what about the other thousand or so products?


In [75]:
count_kei = FoodTrac_DF.select("traces_en").where(col("traces_en").contains("keine")).count()
print(str(count_kei) + " products have 'none' in german")

0 products have 'none' in german


In [76]:
count_en = FoodTrac_DF.select("traces_en").where(col("traces_en").contains("fr:non")).count()
print(str(count_en) + " out of 50900 contain non")
FoodTrac_DF.select("traces_en").where(col("traces_en").contains("non")).show()

2 out of 50900 contain non
+--------------------+
|           traces_en|
+--------------------+
|Milk,fr:noix-non-...|
|              fr:non|
|              fr:non|
+--------------------+



Seems like the only one that matters here is the aucune, the others can remain as they insignficant in amount.

In [77]:
FoodTracDF = FoodTrac_DF.select("product_name","traces_en","main_category_en").where(~col("traces_en").contains("aucune"))
FoodTracDF.show(10)

+--------------------+--------------------+--------------------+
|        product_name|           traces_en|    main_category_en|
+--------------------+--------------------+--------------------+
|stylo glaçage fuc...|              Gluten|fr:decoration-des...|
|             Almonds|                Nuts|             Almonds|
|  Mélange  randonnée|        Nuts,Peanuts|                null|
|Crunchy Canadian ...|Nuts,Peanuts,Soyb...|         Cereal bars|
|Crunchy Oats & Ho...|Nuts,Peanuts,Soyb...|         Cereal bars|
|dar-vida extra fi...|   Nuts,Sesame seeds|        Salty snacks|
|Jolly time, popco...|       Eggs,Soybeans|             Popcorn|
|      ressens spread|                Nuts| fr:Pâtes à tartiner|
|Shortbread Triangles|                Nuts|          Shortbread|
|Pure Butter Short...|                Nuts|          Shortbread|
+--------------------+--------------------+--------------------+
only showing top 10 rows



Ok, we got rid of aucune. 

But now we need to separate the traces to analyze the unique ones:

In [78]:
FoodTraceDF = FoodTracDF.select(
        "traces_en","product_name","main_category_en",
        f.split("traces_en", ",").alias("traces"),
        f.posexplode(f.split("traces_en", ",")).alias("pos", "trace")
    )
FoodTraceDF.show(5)

+--------------------+--------------------+--------------------+--------------------+---+-------+
|           traces_en|        product_name|    main_category_en|              traces|pos|  trace|
+--------------------+--------------------+--------------------+--------------------+---+-------+
|              Gluten|stylo glaçage fuc...|fr:decoration-des...|            [Gluten]|  0| Gluten|
|                Nuts|             Almonds|             Almonds|              [Nuts]|  0|   Nuts|
|        Nuts,Peanuts|  Mélange  randonnée|                null|     [Nuts, Peanuts]|  0|   Nuts|
|        Nuts,Peanuts|  Mélange  randonnée|                null|     [Nuts, Peanuts]|  1|Peanuts|
|Nuts,Peanuts,Soyb...|Crunchy Canadian ...|         Cereal bars|[Nuts, Peanuts, S...|  0|   Nuts|
+--------------------+--------------------+--------------------+--------------------+---+-------+
only showing top 5 rows



Now we are ready to analyze traces

#### Number of Products by trace

How many unique traces are there?

In [79]:
trace_count = FoodTraceDF.select("trace").distinct().count()
print("There are " + str(trace_count) + " unique traces")

There are 321 unique traces


In [80]:
Traces_List = FoodTraceDF.select(col("trace").
                           alias("Trace")
                          ).groupBy("Trace"
                                   ).count().select("Trace",
                                                    f.col("count"
                                                         ).alias("Number of Products")) # Group by Traces / count products
Traces_DF = Traces_List.where(col("Trace").isNotNull()) # show no nulls please

print("The Number of Products by Trace:") # title
Traces_DF.sort(col("Number of Products").desc()).show(10)

The Number of Products by Trace:
+------------+------------------+
|       Trace|Number of Products|
+------------+------------------+
|        Nuts|              1955|
|        Milk|              1006|
|    Soybeans|               999|
|        Eggs|               782|
|Sesame seeds|               754|
|      Gluten|               643|
|     Peanuts|               449|
|      Celery|               277|
|     Mustard|               259|
|       Lupin|               134|
+------------+------------------+
only showing top 10 rows



The top trace is Nuts by far. 

The 2nd most common trace is Milk.

The 3rd most common trace is Soybeans. 

Lets further analyze those three.

#### List containing names of products by trace

Here we will take a look at the top 3 traces: Nuts, Milk, and Soybeans.

The top one is **Nuts**, which many people are allergic to:

In [81]:
Nuts_DF = FoodTraceDF.select("product_name","main_category_en"
                        ).where((col("Trace")=="Nuts"))

NutsDF = Nuts_DF.where(col("generic_name").isNotNull() & col("product_name").isNotNull()).distinct()
print("          Products containing Nuts:")
NutsDF.show(50)

          Products containing Nuts:
+--------------------+--------------------+
|        product_name|    main_category_en|
+--------------------+--------------------+
|Délice d'Ail aux ...|      Salted spreads|
|       Savane jungle|               Cakes|
|  Choco Petit Beurre|es:Galletas con t...|
|    Chocolat au Lait|     Milk chocolates|
|Chocolat au lait ...|     Milk chocolates|
|Madeleine pur beurre|          Madeleines|
|         Rotes Pesto|                Food|
|Croustrifondante ...|      Stuffed wafers|
|       Biscuits bébé| Biscuits for babies|
|     Roggen Vollkorn|        Brown breads|
|    Lait & Noisettes|Milk chocolate wi...|
|Œuf Chocolat Amandes|fr:oeufs-en-chocolat|
|Veritable petit b...|               Cakes|
|Brioche tranchée ...|fr:Brioches tranc...|
|       Beeren Crunch| Mueslis with fruits|
|       Le Crissirois|              Breads|
|SUPRÊME LAIT TROI...|    Swiss chocolates|
|Petit beurre tabl...|Petits-beurres-wi...|
|        Oriental Mix|    fr:snacks-sale

It seems these are mainly biscuits, chocolate, pastries and candy. 

Lets take a look at **Milk** now:

In [82]:
Milk_DF = FoodTraceDF.select("product_name","main_category_en"
                        ).where((col("Trace")=="Milk"))

MilkDF = Milk_DF.where(col("generic_name").isNotNull() & col("product_name").isNotNull()).distinct()
print("          Products containing Milk:")
MilkDF.show(50)

          Products containing Milk:
+--------------------+--------------------+
|        product_name|    main_category_en|
+--------------------+--------------------+
|       Savane jungle|               Cakes|
|    Galettes de maïs|   Puffed corn cakes|
|Gélatine saveur A...|Mixes for jelly d...|
|     Roggen Vollkorn|        Brown breads|
|     Jogurt Waldbeer|       Fruit yogurts|
|       Le Crissirois|              Breads|
|     Tortilla strips|          Corn chips|
|Les Toasts pour S...|       Sliced breads|
|   Kakao Doppel Keks|            Biscuits|
|Magnum Bâtonnet G...|Chocolate ice cre...|
|   Sarments du Médoc|     Dark chocolates|
|Sablé saveur Citr...|fr:Biscuits édulc...|
|Vollkorn Harmonie...|       Sliced breads|
|Pain du montagnar...|      Special breads|
|      Schoko Bananen|     Confectioneries|
|Fitness nutritiou...|   Breakfast cereals|
|Carrefour tuiles ...|    fr:tuiles-salees|
|        TUC Original|            Starters|
|Penne all'arrabbi...|     Microwave mea

With milk traces we find also Cakes, Biscuits, Cereals, Ice cream, and Chocolate with a bit of salty food like lasagne and pate.

Finally, lets check out products with **Soybean** traces:

In [83]:
Soy_DF = FoodTraceDF.select("product_name","main_category_en"
                        ).where((col("Trace")=="Soybeans"))

SoyDF = Soy_DF.where(col("generic_name").isNotNull() & col("product_name").isNotNull()).distinct()
print("         Products containing Soybeans:")
SoyDF.show(50)

         Products containing Soybeans:
+--------------------+--------------------+
|        product_name|    main_category_en|
+--------------------+--------------------+
|       Savane jungle|               Cakes|
|    Galettes de maïs|   Puffed corn cakes|
|Madeleine pur beurre|          Madeleines|
|       Biscuits bébé| Biscuits for babies|
|     Roggen Vollkorn|        Brown breads|
|Brioche tranchée ...|fr:Brioches tranc...|
|       Beeren Crunch| Mueslis with fruits|
|       Le Crissirois|              Breads|
|   Kakao Doppel Keks|            Biscuits|
|Crunchy Oats & Ho...|         Cereal bars|
|Magnum Bâtonnet G...|Chocolate ice cre...|
|   Sarments du Médoc|     Dark chocolates|
|Sablé saveur Citr...|fr:Biscuits édulc...|
|Vollkorn Harmonie...|       Sliced breads|
|Collezione Lasagn...|                Food|
|        TUC Original|            Starters|
|   Pâtes Tagliatelle|         Tagliatelle|
|Madeleines Pur Be...|          Madeleines|
|Le Cake à l'Anglaise|         Fruit 

Again, we find many pastries, biscuits, chocolate, and breads with Soybean traces.

We can observe that most of the products that are considered not very healthy, are the ones containing traces that are allergens for many people. 

### 7. Data Quality Analysis

The following fields are considered fields of interest:

**Source: Raúl Marín**

In [84]:
IntFoodDF = FoodDF.select("creator","created_datetime","last_modified_datetime","product_name","countries_en","traces_en",
              "additives_tags","main_category_en","image_url","quantity","packaging_tags","categories_en",
              "ingredients_text","additives_en","energy-kcal_100g","fat_100g","saturated-fat_100g","sugars_100g",
              "salt_100g","proteins_100g"
             )

#### 7.1 Number of Products with complete info:

In [85]:
products_complete = IntFoodDF.select("creator","created_datetime","last_modified_datetime","product_name","countries_en",
                                     "traces_en", "additives_tags","main_category_en","image_url","quantity",
                                     "packaging_tags","categories_en","ingredients_text","additives_en", "proteins_100g",
                                     "energy-kcal_100g","fat_100g","saturated-fat_100g","sugars_100g","salt_100g"
                                    ).where((col("creator").isNotNull()) & 
                                            (col("created_datetime").isNotNull()) & 
                                            (col("last_modified_datetime").isNotNull()) &
                                            (col("product_name").isNotNull()) & 
                                            (col("countries_en").isNotNull()) & 
                                            (col("traces_en").isNotNull()) &
                                            (col("additives_tags").isNotNull()) &
                                            (col("main_category_en").isNotNull()) &
                                            (col("image_url").isNotNull()) &
                                            (col("quantity").isNotNull()) &
                                            (col("packaging_tags").isNotNull()) &
                                            (col("categories_en").isNotNull()) &
                                            (col("ingredients_text").isNotNull()) &
                                            (col("additives_en").isNotNull()) &
                                            (col("energy-kcal_100g").isNotNull()) &
                                            (col("fat_100g").isNotNull()) &
                                            (col("saturated-fat_100g").isNotNull()) &
                                            (col("sugars_100g").isNotNull()) &
                                            (col("proteins_100g").isNotNull()) &
                                            (col("salt_100g").isNotNull()))
products_complete_count = products_complete.count()
print("There are only " + str(products_complete_count) + " products with fields of interest complete :(")

There are only 948 products with fields of interest complete :(


In [86]:
Full_count = FoodDF.count()
Percent_complete = products_complete_count / Full_count *100
print("Only " + str(Percent_complete) + " % of the products have the fields of interest complete")

Only 1.86017306673469 % of the products have the fields of interest complete


Only 950 products with all of these fields of interest complete, which is less than 2% of the data. 

#### 7.2 The % of products without complete analysis per 100g

In [87]:
products_with_100g = IntFoodDF.select("energy-kcal_100g","fat_100g","saturated-fat_100g","sugars_100g","salt_100g",
                                      "proteins_100g"
                                    ).where((col("energy-kcal_100g").isNotNull()) &
                                            (col("fat_100g").isNotNull()) &
                                            (col("saturated-fat_100g").isNotNull()) &
                                            (col("sugars_100g").isNotNull()) &
                                            (col("proteins_100g").isNotNull()) &
                                            (col("salt_100g").isNotNull())).count()
print("There are " + str(products_with_100g) + " products with a complete analysis per 100g")

There are 28448 products with a complete analysis per 100g


In [88]:
Percent_with_100g = products_with_100g / Full_count * 100
print(str(Percent_with_100g) + "% of the products have a complete 100g analysis")

55.82088966505112% of the products have a complete 100g analysis


In [89]:
Percent_wout_100g = 100 - Percent_with_100g
print(str(Percent_wout_100g) + "% of the products are without a complete 100g analysis")

44.17911033494888% of the products are without a complete 100g analysis


Just over **44.15%** of the products do not have a complete analysis per 100g!

#### 7.3 The % of products without additives info

In [90]:
products_with_additives = IntFoodDF.select("additives_en"
                                     ).where(col("additives_en").isNotNull()).count()
print("There are " + str(products_with_additives) + " products with a information about additives")

There are 11800 products with a information about additives


In [91]:
Percent_with_additives = products_with_additives / Full_count * 100
print(str(Percent_with_additives) + "% of the products have information about additives")

23.1540529403685% of the products have information about additives


In [92]:
Percent_wout_additives = 100 - Percent_with_additives
print(str(Percent_wout_additives) + "% of the products do not have information about additives")

76.8459470596315% of the products do not have information about additives


Around **76.85%** of the products do not have information about additives, this is a concern!

#### 7.4 The % of products without traces info

In [93]:
products_with_traces = IntFoodDF.select("traces_en"
                                     ).where(col("traces_en").isNotNull()).count()
print("There are " + str(products_with_traces) + " products with a information about traces")

There are 4693 products with a information about traces


In [94]:
Percent_with_traces = products_with_traces / Full_count * 100
print(str(Percent_with_traces) + "% of the products have information about traces")

9.208641563487236% of the products have information about traces


In [95]:
Percent_wout_traces = 100 - Percent_with_traces
print(str(Percent_wout_traces) + "% of the products do not have information about traces")

90.79135843651277% of the products do not have information about traces


Around **90.79%** of the products do not have information about traces, this is more of a concern as there are many allergic people. 

### 8. Data Profiling on fields of interest

In [96]:
per_100g_DF = IntFoodDF.select("energy-kcal_100g","fat_100g","saturated-fat_100g","sugars_100g","salt_100g", 
                               "proteins_100g")
per_100g_DF.show(10)

+----------------+---------------+------------------+----------------+----------------+-------------+
|energy-kcal_100g|       fat_100g|saturated-fat_100g|     sugars_100g|       salt_100g|proteins_100g|
+----------------+---------------+------------------+----------------+----------------+-------------+
|            null|           null|              null|            null|            null|         null|
|            null|           null|              null|            null|            null|         null|
|            81.0|            0.9|               0.1|             0.1|             0.3|         18.3|
|           357.0|            3.0|              null|            null|            null|          8.0|
|           366.0|2.2999999523163|               0.5| 1.7999999523163|             1.0|         72.0|
|            null|           null|              null|            null|            null|         null|
|            97.0|            1.5|               0.7|             0.0|            

#### Stats on analysis per 100g fields

In [97]:
per_100g_DF.describe().show()

+-------+------------------+------------------+------------------+------------------+------------------+-----------------+
|summary|  energy-kcal_100g|          fat_100g|saturated-fat_100g|       sugars_100g|         salt_100g|    proteins_100g|
+-------+------------------+------------------+------------------+------------------+------------------+-----------------+
|  count|             29730|             39655|             38821|             38903|             38422|            39623|
|   mean|270.39247436250906|13.422931571077639| 5.082555317184795|13.303984116638377| 1.419919764111659|8.590529284569868|
| stddev|197.51975028162727|16.994197782579807| 8.003617833349761|19.048782450241227|19.541143107168136|  9.6387247901509|
|    min|               0.0|               0.0|               0.0|               0.0|               0.0|              0.0|
|    max|            2590.0|             112.0|             220.0|             114.0|            2590.0|            190.0|
+-------+-------

It seems the data unavailability of the energy-kcal field is weighing the per 100g analysis down: it has almost 10 thousand rows less than the others.

Nevertheless, we can see some important statistics to determine the overall product health, at least for the ones with labelling, in Swizterland.

## V. Product Health Metrics

First lets check out in https://www.nhs.uk/live-well/eat-well/how-to-read-food-labels/?tabname=digestive-health what healthy foods actually means. 

The NHS says the following:

"For a balanced diet:

- Eat at least 5 portions of a variety of fruit and vegetables every day 

- Base meals on potatoes, bread, rice, pasta or other starchy carbohydrates – choose wholegrain or higher fibre where possible

- Have some dairy or dairy alternatives, such as soya drinks and yoghurts – choose lower-fat and lower-sugar options 

- Eat some beans, pulses, fish, eggs, meat and other protein – aim for 2 portions of fish every week, 1 of which should be oily, such as salmon or mackerel

- Choose unsaturated oils and spreads, and eat them in small amounts

- Drink plenty of fluids – the government recommends 6 to 8 cups or glasses a day

If you're having foods and drinks that are high in fat, salt and sugar, have these less often and in small amounts.

Try to choose a variety of different foods from the 4 main food groups." - NHS

So in summary **"Healthy"** means a balance between the 4 food groups:
1. Fruits & Vegetables (G1)
2. Carbohydrates (G2)
3. Low-fat Dairy (G3)
4. Low-fat Protein (G4)

Also, lots of **fluids**! Lets call it our 5th group.

5. Fluids (G5)

And finally, all of these groups must be low in **fat, salt** and **sugar** !

### Food Category Distribution

Lets go back to our categories and identify each into one of these 4 food groups, and see how they are distributed.

In [98]:
catg_filled = FoodDF.where(col("categories_en").isNotNull()).count()
print("There are a total " + str(catg_filled) + " products with their categories.")

There are a total 18049 products with their categories.


Lets take a look at the category in order to categorize into our 5 product groups.

In [99]:
from pyspark.sql.functions import lower, col
Categories_df = FoodDF.select(col("categories_en").
                           alias("Categories")
                          ).groupBy("Categories"
                                   ).count().select("Categories",
                                                    f.col("count"
                                                         ).alias("Number of Products")) # Group by Main Category / count products
Categories_DF = Categories_df.where(col("Categories").isNotNull()) # show no nulls please

Categories_DF = Categories_DF.select(lower(col("Categories")).alias("category"),(col("Number of Products")))
Categories_DF = Categories_DF.distinct()

print("The Number of Products by Categories") # title
Categories_DF.sort(col("Number of Products").desc()).show(500)

The Number of Products by Categories
+--------------------+------------------+
|            category|Number of Products|
+--------------------+------------------+
|dairies,fermented...|               205|
|dairies,fermented...|               173|
|snacks,sweet snac...|               166|
|           beverages|               140|
|snacks,sweet snac...|               121|
|snacks,sweet snac...|               113|
|plant-based foods...|               102|
|dairies,fermented...|               100|
|plant-based foods...|               100|
|snacks,sweet snac...|                98|
|beverages,alcohol...|                87|
|snacks,sweet snac...|                85|
|plant-based foods...|                84|
|    groceries,sauces|                76|
|snacks,sweet snac...|                73|
|               meals|                71|
|groceries,sauces,...|                70|
|snacks,salty snac...|                69|
|beverages,alcohol...|                63|
|spreads,breakfast...|                6

In [100]:
categories_list_df = Categories_DF.select(
        "category","Number of Products",
        f.split("category", ",").alias("category"),
        f.posexplode(f.split("category", ",")).alias("pos", "cat")
    )
categories_list_df.show(10)

+--------------------+------------------+--------------------+---+--------------------+
|            category|Number of Products|            category|pos|                 cat|
+--------------------+------------------+--------------------+---+--------------------+
|plant-based foods...|                 1|[plant-based food...|  0|plant-based foods...|
|plant-based foods...|                 1|[plant-based food...|  1|   plant-based foods|
|plant-based foods...|                 1|[plant-based food...|  2|           groceries|
|plant-based foods...|                 1|[plant-based food...|  3|              snacks|
|plant-based foods...|                 1|[plant-based food...|  4|          condiments|
|plant-based foods...|                 1|[plant-based food...|  5|fruits and vegeta...|
|plant-based foods...|                 1|[plant-based food...|  6|        salty snacks|
|plant-based foods...|                 1|[plant-based food...|  7|          appetizers|
|plant-based foods...|          

In [101]:
categories_l = categories_list_df.select("cat","Number of Products").where(
    col("pos")==1) # change value X in <col("pos")==X> to see the different levels of categorization 
grouped_cat = categories_l.select("cat","Number of Products").groupBy("cat").sum("Number of Products")
dstnct_cat = grouped_cat.select("cat","sum(Number of Products)").orderBy(col("sum(Number of Products)").desc())
dstnct_cat.show(50)

+--------------------+-----------------------+
|                 cat|sum(Number of Products)|
+--------------------+-----------------------+
|   plant-based foods|                   4815|
|        sweet snacks|                   2554|
|     fermented foods|                   1684|
|           beverages|                   1197|
|              sauces|                    657|
| alcoholic beverages|                    628|
|      prepared meats|                    413|
|        salty snacks|                    339|
|          breakfasts|                    269|
|        frozen foods|                    263|
|              fishes|                    249|
|            desserts|                    226|
|          condiments|                    214|
|   carbonated drinks|                    196|
|               milks|                    151|
|              waters|                    150|
|               meals|                    145|
|              syrups|                    124|
|            

Here we can see the different categories in order to plug into our food grouping!

In [102]:
cats = dstnct_cat.count()
print("There are " + str(cats) + " different categories")

There are 233 different categories


#### Group 1: Fruits and Vegetables

In [103]:
G1_DF = FoodDF.where((col("categories_en").contains("vegetable")) | 
                     (col("categories_en").contains("fruit")) |
                     ((col("categories_en").contains("plant"))) |
                     (col("categories_en").contains("salads")))

G1 = G1_DF.count() # amount of products in Group 1
G1df = G1_DF.select("energy-kcal_100g","fat_100g","saturated-fat_100g","sugars_100g","salt_100g")
print("There are " + str(G1) + " products in Group 1")
G1df.describe().show()

fandv_perc = (G1 / catg_filled) * 100
print("Only " + str(fandv_perc) + " % of the categorized products are Fruits and Vegetables")

There are 2059 products in Group 1
+-------+------------------+------------------+------------------+------------------+------------------+
|summary|  energy-kcal_100g|          fat_100g|saturated-fat_100g|       sugars_100g|         salt_100g|
+-------+------------------+------------------+------------------+------------------+------------------+
|  count|              1242|              1765|              1727|              1748|              1724|
|   mean|172.19039988543565| 5.525515176914641| 2.149489784024708| 17.36731566745439|0.6845514635262131|
| stddev|171.72147949466137|13.730619007384417| 9.108011974952497|20.732195434631844| 3.287049754232219|
|    min|               0.0|               0.0|               0.0|               0.0|               0.0|
|    max|             927.0|             100.0|             100.0|              79.0|              56.9|
+-------+------------------+------------------+------------------+------------------+------------------+

Only 11.40783422904

#### Group 2: Carbohydrates

In [104]:
G2_DF = FoodDF.where((col("categories_en").contains("carb")) | 
                     (col("categories_en").contains("pasta")) |
                     (col("categories_en").contains("pizza")) |
                     (col("categories_en").contains("dessert")) |
                     (col("categories_en").contains("biscuit")) |
                     (col("categories_en").contains("cake")) |
                     (col("categories_en").contains("pastry")) |
                     (col("categories_en").contains("chocolate")) |
                     (col("categories_en").contains("pain")) |
                     (col("categories_en").contains("bread")) |
                     (col("categories_en").contains("marzipan")))
                             
G2 = G2_DF.count() # amount of products in Group 2
G2df = G2_DF.select("energy-kcal_100g","fat_100g","saturated-fat_100g","sugars_100g","salt_100g")
print("There are " + str(G2) + " products in Group 2")
G2df.describe().show()

carbs_perc = (G2 / catg_filled) * 100
print(str(carbs_perc) + " % of the categorized products are Carbohydrates")

There are 3192 products in Group 2
+-------+------------------+------------------+------------------+------------------+------------------+
|summary|  energy-kcal_100g|          fat_100g|saturated-fat_100g|       sugars_100g|         salt_100g|
+-------+------------------+------------------+------------------+------------------+------------------+
|  count|              2084|              3088|              3069|              3066|              3061|
|   mean| 384.4677004691832|17.846997733619038| 8.975873247736649| 24.10378125430293|0.5564624581508725|
| stddev|156.71530938905724|13.579058052448117| 8.405193911386306|18.862911331257777|1.3589862627376483|
|    min|               0.0|               0.0|               0.0|               0.0|               0.0|
|    max|            1598.0|             100.0|              50.0|              95.0|              44.8|
+-------+------------------+------------------+------------------+------------------+------------------+

17.68519031525292 %

#### Group 3: Dairy

In [105]:
G3_DF = FoodDF.where((col("categories_en").contains("dairies")) | 
                     (col("categories_en").contains("yogurt")) |
                     (col("categories_en").contains("milk")) |
                     (col("categories_en").contains("fermented")) |
                     (col("categories_en").contains("creams")))
                             
G3 = G3_DF.count() # amount of products in Group 3
G3df = G3_DF.select("energy-kcal_100g","fat_100g","saturated-fat_100g","sugars_100g","salt_100g")
print("There are " + str(G3) + " products in Group 3")
G3df.describe().show()

dairy_perc = (G3 / catg_filled) * 100
print(str(dairy_perc) + " % of the categorized products are Dairy")

There are 2441 products in Group 3
+-------+------------------+------------------+------------------+------------------+------------------+
|summary|  energy-kcal_100g|          fat_100g|saturated-fat_100g|       sugars_100g|         salt_100g|
+-------+------------------+------------------+------------------+------------------+------------------+
|  count|              1656|              2269|              2198|              2201|              2203|
|   mean|223.54453769457865|14.820733379266658| 8.532521809424056|  8.03930915580569|0.6746726469097699|
| stddev| 155.4661647333405|12.645336301457627|  8.22524036485264|10.362397118533531|0.8772953482976635|
|    min|               0.0|               0.0|               0.0|               0.0|               0.0|
|    max|            2000.0|              70.0|              52.0|              94.0|              11.3|
+-------+------------------+------------------+------------------+------------------+------------------+

13.524294974790847 

#### Group 4: Protein

In [106]:
G4_DF = FoodDF.where((col("categories_en").contains("meat")) | 
                     (col("categories_en").contains("seafood")) |
                     (col("categories_en").contains("fish")) |
                     (col("categories_en").contains("beef")) |
                     (col("categories_en").contains("egg")))

G4 = G4_DF.count() # amount of products in Group 4
G4df = G4_DF.select("energy-kcal_100g","fat_100g","saturated-fat_100g","sugars_100g","salt_100g","proteins_100g")
print("There are " + str(G4) + " products in Group 4")
G4df.describe().show()

protein_perc = (G4 / catg_filled) * 100
print("Only " + str(protein_perc) + " % of the categorized products are Protein")

There are 1095 products in Group 4
+-------+------------------+------------------+------------------+------------------+------------------+-----------------+
|summary|  energy-kcal_100g|          fat_100g|saturated-fat_100g|       sugars_100g|         salt_100g|    proteins_100g|
+-------+------------------+------------------+------------------+------------------+------------------+-----------------+
|  count|               726|              1002|               962|               960|               970|             1001|
|   mean|234.29889807162533|13.490059880382281|4.5667359671734005|1.8842781247213798|2.0792707840348355| 19.0742457560845|
| stddev|134.06837236182736|11.509252197510614| 5.353758437449673| 6.254162250898415| 1.629927559554961|9.719925282202805|
|    min|               0.0|               0.0|               0.0|               0.0|               0.0|              0.0|
|    max|            1150.0|              56.0|              37.0|              53.6|              18.0|

#### Group 5: Fluids

In [107]:
G5_DF = FoodDF.where((col("categories_en").contains("beverage")) | 
                     (col("categories_en").contains("alcohol")) |
                     (col("categories_en").contains("whiskey")) |
                     (col("categories_en").contains("coffee")) |
                     (col("categories_en").contains("wine")) |
                     (col("categories_en").contains("cider")) |
                     (col("categories_en").contains("water")) |    
                     (col("categories_en").contains("drink")))

G5 = G5_DF.count() # amount of products in Group 5
G5df = G5_DF.select("energy-kcal_100g","fat_100g","saturated-fat_100g","sugars_100g","salt_100g")
print("There are " + str(G5) + " products in Group 5")
G5df.describe().show()

fluids_perc = (G5 / catg_filled) * 100
print(str(fluids_perc) + " % of the categorized products are Fluids")

There are 7490 products in Group 5
+-------+------------------+------------------+------------------+------------------+------------------+
|summary|  energy-kcal_100g|          fat_100g|saturated-fat_100g|       sugars_100g|         salt_100g|
+-------+------------------+------------------+------------------+------------------+------------------+
|  count|              4321|              6040|              5952|              5991|              5982|
|   mean|244.45834250490879|10.107220106466333|2.4531671078883552| 10.20343915216165|1.0051219248959782|
| stddev| 215.9762977394819|20.282408772022176| 6.806984423423575|15.939810966534521|22.719854727786032|
|    min|               0.0|               0.0|               0.0|               0.0|               0.0|
|    max|            2590.0|             100.0|             100.0|             100.0|            1740.0|
+-------+------------------+------------------+------------------+------------------+------------------+

41.49814394149261 %

Amount of products captured food products categories:

The breakdown is as follows:

- Fruits & Vegetables: 11.4%
- Carbohydrates: 17.7%
- Dairy: 13.5%
- Protein: 6.1%
- Fluids: 41.5%

In [108]:
captured = (fluids_perc + protein_perc + dairy_perc + carbs_perc + fandv_perc)
print("A good " + str(captured) + " % was captured with the above filtering")

A good 90.18228156684582 % was captured with the above filtering


We can observe in the food category distribution, that food habits in Switzerland are not well balanced as carbohydrates have a much higher participation that every other group. Dairy and Fruits & Vegetables have a similar distribution; it is intruiguing to see how people eat as much cheese and milk as fruits and vegetables. The protein group which is as important as the others, if not more, is well below with only a 6.1% share. And finally, between sweet beverages, alcoholic beverages, and such "unhealthy" products, fluids comprise almost half of the food and beverages products in Switzerland.

We can now observe, for each of these food groups, the 5 important metrics that NHS suggests overall Health depends on.

Lets now take a look at each metric individually and compare the results to NHS guidelines.

### NHS Metric Guidelines

Check out the guidelines here: https://www.nhs.uk/live-well/eat-well/what-are-reference-intakes-on-food-labels/

#### Energy / Calories

When we eat and drink more calories than we use up, our bodies store the excess as body fat. If this continues, over time we may put on weight.

As a guide, an average man needs around 2,500kcal (10,500kJ) a day to maintain a healthy body weight.

For an average woman, that figure is around 2,000kcal (8,400kJ) a day.

- breakfast: 20% (a fifth of your energy intake) = 500 kcal (men) / 400 kcal (women)
- lunch: 30% (about a third of your energy intake) = 750 kcal (men) / 600 kcal (women)
- evening meal: 30% (about a third of your energy intake) = 750 kcal (men) / 600 kcal (women)
- drinks and snacks: 20% (a fifth of your energy intake) = 500 kcal (men) / 400 kcal (women)

**Summary** 

Averages:
- Fruits & Vegetables: 235 kcal
- Carbohydrates: 385 kcal
- Dairy: 224 kcal
- Protein: 234 kcal
- Fluids: 245 kcal

Sum = 1323 kcal

Looks like by taking one of each, as the balanced diet suggests, would put both men and women way above the needs of any single meal.

In [109]:
print("Energy of Group 1: Fruits & Vegetables")
G1df.select("energy-kcal_100g").describe().show()

Energy of Group 1: Fruits & Vegetables
+-------+------------------+
|summary|  energy-kcal_100g|
+-------+------------------+
|  count|              1242|
|   mean|172.19039988543565|
| stddev|171.72147949466137|
|    min|               0.0|
|    max|             927.0|
+-------+------------------+



In [110]:
print("Energy of Group 2: Carbohydrates")
G2df.select("energy-kcal_100g").describe().show()

Energy of Group 2: Carbohydrates
+-------+------------------+
|summary|  energy-kcal_100g|
+-------+------------------+
|  count|              2084|
|   mean| 384.4677004691832|
| stddev|156.71530938905724|
|    min|               0.0|
|    max|            1598.0|
+-------+------------------+



In [111]:
print("Energy of Group 3: Dairy")
G3df.select("energy-kcal_100g").describe().show()

Energy of Group 3: Dairy
+-------+------------------+
|summary|  energy-kcal_100g|
+-------+------------------+
|  count|              1656|
|   mean|223.54453769457865|
| stddev| 155.4661647333405|
|    min|               0.0|
|    max|            2000.0|
+-------+------------------+



In [112]:
print("Energy of Group 4: Protein")
G4df.select("energy-kcal_100g").describe().show()

Energy of Group 4: Protein
+-------+------------------+
|summary|  energy-kcal_100g|
+-------+------------------+
|  count|               726|
|   mean|234.29889807162533|
| stddev|134.06837236182736|
|    min|               0.0|
|    max|            1150.0|
+-------+------------------+



In [113]:
print("Energy of Group 5: Fluids")
G5df.select("energy-kcal_100g").describe().show()

Energy of Group 5: Fluids
+-------+------------------+
|summary|  energy-kcal_100g|
+-------+------------------+
|  count|              4321|
|   mean|244.45834250490879|
| stddev| 215.9762977394819|
|    min|               0.0|
|    max|            2590.0|
+-------+------------------+



#### Fat

- high fat – more than 17.5g of fat per 100g
- low fat – 3g of fat or less per 100g, or 1.5g of fat per 100ml for liquids (1.8g of fat per 100ml for semi-skimmed milk)
- fat-free – 0.5g of fat or less per 100g or 100ml

**Summary** 

Averages:
- Fruits & Vegetables: 14.43g
- Carbohydrates: 17.85g
- Dairy: 14.82g
- Protein: 13.49g
- Fluids: 10.06g

We can observe a lot of fat for all food groups, as they are all way above the 3 gram low fat metric. 

We can observe an excess of fat in carbohydrates, as the average is above the 17.5 gram high fat metric.

In [114]:
print("Fats of Group 1: Fruits & Vegetables")
G1df.select("fat_100g").describe().show()
print("It seems the mean fat is very much above the low fat recommendations (14.4g > 3g), according to NHS")

Fats of Group 1: Fruits & Vegetables
+-------+------------------+
|summary|          fat_100g|
+-------+------------------+
|  count|              1765|
|   mean| 5.525515176914641|
| stddev|13.730619007384417|
|    min|               0.0|
|    max|             100.0|
+-------+------------------+

It seems the mean fat is very much above the low fat recommendations (14.4g > 3g), according to NHS


In [115]:
print("Fats of Group 2: Carbohydrates")
G2df.select("fat_100g").describe().show()
print("It seems the mean fat is above the HIGH fat recommendations (17.9g > 17.5g), according to NHS")

Fats of Group 2: Carbohydrates
+-------+------------------+
|summary|          fat_100g|
+-------+------------------+
|  count|              3088|
|   mean|17.846997733619038|
| stddev|13.579058052448117|
|    min|               0.0|
|    max|             100.0|
+-------+------------------+

It seems the mean fat is above the HIGH fat recommendations (17.9g > 17.5g), according to NHS


In [116]:
print("   Fats of Group 3: Dairy")
G3df.select("fat_100g").describe().show()
print("It seems the mean fat is very much above the low fat recommendations but a bit below the high fat, according to NHS")

   Fats of Group 3: Dairy
+-------+------------------+
|summary|          fat_100g|
+-------+------------------+
|  count|              2269|
|   mean|14.820733379266658|
| stddev|12.645336301457627|
|    min|               0.0|
|    max|              70.0|
+-------+------------------+

It seems the mean fat is very much above the low fat recommendations but a bit below the high fat, according to NHS


In [117]:
print("  Fats of Group 4: Proteins")
G4df.select("fat_100g").describe().show()
print("It seems the mean fat is very much above the low fat recommendations, but below the high fat, according to NHS")

  Fats of Group 4: Proteins
+-------+------------------+
|summary|          fat_100g|
+-------+------------------+
|  count|              1002|
|   mean|13.490059880382281|
| stddev|11.509252197510614|
|    min|               0.0|
|    max|              56.0|
+-------+------------------+

It seems the mean fat is very much above the low fat recommendations, but below the high fat, according to NHS


In [118]:
print("   Fats of Group 5: Fluids")
G5df.select("fat_100g").describe().show()
print("It seems the mean fat is very much above the low fat recommendations, but well below the high fat, according to NHS")

   Fats of Group 5: Fluids
+-------+------------------+
|summary|          fat_100g|
+-------+------------------+
|  count|              6040|
|   mean|10.107220106466333|
| stddev|20.282408772022176|
|    min|               0.0|
|    max|             100.0|
+-------+------------------+

It seems the mean fat is very much above the low fat recommendations, but well below the high fat, according to NHS


#### Saturated Fat

- high in sat fat – more than 5 g of saturates per 100g
- low in sat fat – 1.5 g of saturates or less per 100g or 0.75g per 100ml for liquids
- sat fat-free – 0.1 g of saturates per 100g or 100ml

**Summary** 

Averages:
- Fruits & Vegetables: 2.9 g
- Carbohydrates: 8.9 g
- Dairy: 8.5 g
- Protein: 4.6 g
- Fluids: 2.4 g

Even the average fruit and vegetable is above the low saturated fat health recommendations. 

Not to mention Carbs and Dairy, which are very very much above the 5 g high saturated fat health recommendation. 

In [119]:
print("Saturated Fats of Group 1: Fruits & Vegetables")
G1df.select("saturated-fat_100g").describe().show()
print("It seems the mean fat is very much above the low fat recommendations (2.91g > 1.5g), according to NHS")

Saturated Fats of Group 1: Fruits & Vegetables
+-------+------------------+
|summary|saturated-fat_100g|
+-------+------------------+
|  count|              1727|
|   mean| 2.149489784024708|
| stddev| 9.108011974952497|
|    min|               0.0|
|    max|             100.0|
+-------+------------------+

It seems the mean fat is very much above the low fat recommendations (2.91g > 1.5g), according to NHS


In [120]:
print("Saturated Fats of Group 2: Carbohydrates")
G2df.select("saturated-fat_100g").describe().show()
print("It seems the mean sat fat is very much above the HIGH sat fat recommendations (8.9g > 5g), according to NHS")

Saturated Fats of Group 2: Carbohydrates
+-------+------------------+
|summary|saturated-fat_100g|
+-------+------------------+
|  count|              3069|
|   mean| 8.975873247736649|
| stddev| 8.405193911386306|
|    min|               0.0|
|    max|              50.0|
+-------+------------------+

It seems the mean sat fat is very much above the HIGH sat fat recommendations (8.9g > 5g), according to NHS


In [121]:
print("Saturated Fats of Group 3: Dairy")
G3df.select("saturated-fat_100g").describe().show()
print("It seems the mean sat fat is very much above the HIGH sat fat recommendations (8.5g > 5g), according to NHS")

Saturated Fats of Group 3: Dairy
+-------+------------------+
|summary|saturated-fat_100g|
+-------+------------------+
|  count|              2198|
|   mean| 8.532521809424056|
| stddev|  8.22524036485264|
|    min|               0.0|
|    max|              52.0|
+-------+------------------+

It seems the mean sat fat is very much above the HIGH sat fat recommendations (8.5g > 5g), according to NHS


In [122]:
print("Saturated Fats of Group 4: Protein")
G4df.select("saturated-fat_100g").describe().show()
print("It seems the mean sat fat is very much above the low sat fat recommendations (4.6g > 1,5g), according to NHS")

Saturated Fats of Group 4: Protein
+-------+------------------+
|summary|saturated-fat_100g|
+-------+------------------+
|  count|               962|
|   mean|4.5667359671734005|
| stddev| 5.353758437449673|
|    min|               0.0|
|    max|              37.0|
+-------+------------------+

It seems the mean sat fat is very much above the low sat fat recommendations (4.6g > 1,5g), according to NHS


In [123]:
print("Saturated Fats of Group 5: Fluids")
G5df.select("saturated-fat_100g").describe().show()
print("It seems the mean sat fat is very much above the low sat fat recommendations (2.4g > 1,5g), according to NHS")

Saturated Fats of Group 5: Fluids
+-------+------------------+
|summary|saturated-fat_100g|
+-------+------------------+
|  count|              5952|
|   mean|2.4531671078883552|
| stddev| 6.806984423423575|
|    min|               0.0|
|    max|             100.0|
+-------+------------------+

It seems the mean sat fat is very much above the low sat fat recommendations (2.4g > 1,5g), according to NHS


#### Sugars

- high: more than 22.5g of total sugars per 100g
- low: 5g or less of total sugars per 100g

**Summary** 

Averages:
- Fruits & Vegetables: 9.6 g
- Carbohydrates: 24.1 g
- Dairy: 8.1 g
- Protein: 1.9 g
- Fluids: 10.0 g

It seems only carbohydrates average out higher than the 22.5 grams of high total sugar suggestion. 

While all other food groups are higher than the low total sugar suggestion, except for Protein. 

In [124]:
print("Sugars of Group 5: Fluids")
G1df.select("sugars_100g").describe().show()

Sugars of Group 5: Fluids
+-------+------------------+
|summary|       sugars_100g|
+-------+------------------+
|  count|              1748|
|   mean| 17.36731566745439|
| stddev|20.732195434631844|
|    min|               0.0|
|    max|              79.0|
+-------+------------------+



In [125]:
print("Sugars of Group 5: Fluids")
G2df.select("sugars_100g").describe().show()

Sugars of Group 5: Fluids
+-------+------------------+
|summary|       sugars_100g|
+-------+------------------+
|  count|              3066|
|   mean| 24.10378125430293|
| stddev|18.862911331257777|
|    min|               0.0|
|    max|              95.0|
+-------+------------------+



In [126]:
print("Sugars of Group 3: Dairy")
G3df.select("sugars_100g").describe().show()

Sugars of Group 3: Dairy
+-------+------------------+
|summary|       sugars_100g|
+-------+------------------+
|  count|              2201|
|   mean|  8.03930915580569|
| stddev|10.362397118533531|
|    min|               0.0|
|    max|              94.0|
+-------+------------------+



In [127]:
print("Sugars of Group 4: Protein")
G4df.select("sugars_100g").describe().show()

Sugars of Group 4: Protein
+-------+------------------+
|summary|       sugars_100g|
+-------+------------------+
|  count|               960|
|   mean|1.8842781247213798|
| stddev| 6.254162250898415|
|    min|               0.0|
|    max|              53.6|
+-------+------------------+



In [128]:
print("Sugars of Group 5: Fluids")
G5df.select("sugars_100g").describe().show()

Sugars of Group 5: Fluids
+-------+------------------+
|summary|       sugars_100g|
+-------+------------------+
|  count|              5991|
|   mean| 10.20343915216165|
| stddev|15.939810966534521|
|    min|               0.0|
|    max|             100.0|
+-------+------------------+



#### Salt

To convert sodium to salt, you need to multiply the sodium amount by 2.5. For example, 1g of sodium per 100g is 2.5 grams of salt per 100g.

Adults should eat no more than 2.4g of sodium per day, as this is equal to 6g of salt.

- High in salt: more than 1.5g of salt per 100g
- Low in salt: 0.3g of salt or less per 100g 

**Summary** 

Averages:
- Fruits & Vegetables: 1.14 g
- Carbohydrates: 0.56 g
- Dairy: 0.67 g
- Protein: 2.08 g
- Fluids: 1.01 g

It seems only the Protein group average out higher than the 22.5 grams of high total salt suggestion. Of course, much salt is needed for all meats, which are the main sources of protein.

However, all other food groups are higher than the low total salt suggestion, even Dairy!

In [129]:
print("Salt of Group 1: Fruits and Vegetables")
G1df.select("salt_100g").describe().show()

Salt of Group 1: Fruits and Vegetables
+-------+------------------+
|summary|         salt_100g|
+-------+------------------+
|  count|              1724|
|   mean|0.6845514635262131|
| stddev| 3.287049754232219|
|    min|               0.0|
|    max|              56.9|
+-------+------------------+



In [130]:
print("Salt of Group 5: Carbohydrates")
G2df.select("salt_100g").describe().show()

Salt of Group 5: Carbohydrates
+-------+------------------+
|summary|         salt_100g|
+-------+------------------+
|  count|              3061|
|   mean|0.5564624581508725|
| stddev|1.3589862627376483|
|    min|               0.0|
|    max|              44.8|
+-------+------------------+



In [131]:
print("Salt of Group 5: Dairy")
G3df.select("salt_100g").describe().show()

Salt of Group 5: Dairy
+-------+------------------+
|summary|         salt_100g|
+-------+------------------+
|  count|              2203|
|   mean|0.6746726469097699|
| stddev|0.8772953482976635|
|    min|               0.0|
|    max|              11.3|
+-------+------------------+



In [132]:
print("Salt of Group 5: Protein")
G4df.select("salt_100g").describe().show()

Salt of Group 5: Protein
+-------+------------------+
|summary|         salt_100g|
+-------+------------------+
|  count|               970|
|   mean|2.0792707840348355|
| stddev| 1.629927559554961|
|    min|               0.0|
|    max|              18.0|
+-------+------------------+



In [133]:
print("Salt of Group 5: Fluids")
G5df.select("salt_100g").describe().show()

Salt of Group 5: Fluids
+-------+------------------+
|summary|         salt_100g|
+-------+------------------+
|  count|              5982|
|   mean|1.0051219248959782|
| stddev|22.719854727786032|
|    min|               0.0|
|    max|            1740.0|
+-------+------------------+



### Conclusion

After studying food products in Switzerland we have determined that almost all of the 5 identified food groups are above the low bounds suggested by the NHS and some food groups are completely above the high bounds suggested by the NHS. Therefore, we have determined that some food products in this country are unhealthy and that some food products are extremely unhealthy, according to the NHS. Finally, we recommend that regulations are put in place to make sure the health of Swiss citizens is attended with seriousness.