# Cleaning the Swiss Food Composition Dataset

This is an auxiliary notebook that documents the steps taken to clean the dataset from the [Swiss Food Composition Database](https://naehrwertdaten.ch/en/) and make it ready for use in the tutorials. It uses functions in the `clean_dataset.py` module found in the same folder.

For more information on the abbreviations of the dataset [see here](https://valeursnutritives.ch/en/abbreviations/).

## Setup

In [None]:
# importing required packages
import os
import requests

import clean_dataset

# setting jupyter notebook settings to reload custom packages
%load_ext autoreload
%autoreload 2

In [None]:
# URL of the dataset
url_data = "https://naehrwertdaten.ch/wp-content/uploads/2022/06/Swiss_food_composition_database.xlsx"

# path to store the raw data
path2data = "../data/swiss_food_composition_raw.xlsx"

## Fetch data

In [None]:
if not os.path.isfile(path2data):
    response = requests.get(url_data)

    with open(path2data, 'wb') as file:
        file.write(response.content)
else:
    print(f'File {path2data} already exists.')

## Clean data

In [None]:
dataset = clean_dataset.process_dataset(path2data)
dataset.head()

## Check data

In [None]:
# check columns in data
dataset.columns

In [None]:
# check number of traces
tr_count = dataset.apply(lambda x: x.value_counts().get('tr.', 0))
assert sum(tr_count)==0

In [None]:
# check that there are no negative values
count = dataset.astype(str).applymap(lambda x: x.count('<')).sum().sum()
assert count==0

In [None]:
# check value range in numerical columns
dataset.describe()

## Save data

If all tests passed, we save the dataset in a `.csv` file in the `data` folder. Do not save the first column as it is just the index.

In [None]:
# set the index
dataset.index.name = "ID"

# save
path2store = path2data.replace("_raw.xlsx", ".csv")
dataset.to_csv(path2store)
print(f"Stored cleaned data in: {path2store}")
