# Cleaning the Chocolate Dataset

This is an auxiliary notebook that documents the steps taken to clean the dataset from the [Chocolate Bar Ratings Dataset](https://www.kaggle.com/datasets/evangower/chocolate-bar-ratings) and make it ready for use in the tutorials. It uses functions in the `clean_chocolate_dataset.py` module found in the same folder.

For more information on the abbreviations of the dataset [see here](https://valeursnutritives.ch/en/abbreviations/).

## Setup

In [300]:
# importing required packages
import os
from urllib import request
import clean_chocolate_dataset

# setting jupyter notebook settings to reload custom packages
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [301]:
# URL of the dataset
url_data = "https://www.kaggle.com/datasets/evangower/chocolate-bar-ratings/download?datasetVersionNumber=1"

# path to store the raw data
path2data = "../data/chocolate_bars.csv"

## Fetch data

In [302]:
if not os.path.isfile(path2data):
    response = request.get(url_data)

    with open(path2data, 'wb') as file:
        file.write(response.content)
else:
    print(f'File {path2data} already exists.')

File ../data/chocolate_bars.csv already exists.


## Clean data

In [303]:
dataset = clean_chocolate_dataset.process_dataset(path2data)
dataset.head()

Before imputation:
Train set missing values: 
id                  0
bean_origin         0
year_reviewed       0
cocoa_percent       0
num_ingredients    75
rating              0
dtype: int64
Test set missing values: 
id                  0
bean_origin         0
year_reviewed       0
cocoa_percent       0
num_ingredients    12
rating              0
dtype: int64
After imputation:
Train set missing values: 
id                 0
bean_origin        0
year_reviewed      0
cocoa_percent      0
num_ingredients    0
rating             0
dtype: int64
Test set missing values: 
id                 0
bean_origin        0
year_reviewed      0
cocoa_percent      0
num_ingredients    0
rating             0
dtype: int64
Standardizing the data...
Standardization done!


Unnamed: 0,id,bean_origin,year_reviewed,cocoa_percent,num_ingredients,rating,split
0,2454,Tanzania,2019,0.771568,-0.04148,3.25,train
1,2458,Dominican Republic,2019,0.771568,-0.04148,3.5,test
2,2454,Madagascar,2019,0.771568,-0.04148,3.75,train
3,2542,Fiji,2021,-0.682486,-0.04148,3.0,train
4,2546,Venezuela,2021,0.044541,-0.04148,3.0,train


## Check data

In [304]:
# check columns in data
dataset.columns

Index(['id', 'bean_origin', 'year_reviewed', 'cocoa_percent',
       'num_ingredients', 'rating', 'split'],
      dtype='object')

In [305]:
# check number of traces
tr_count = dataset.apply(lambda x: x.value_counts().get('tr.', 0))
assert sum(tr_count)==0

In [306]:
# check that there are no negative values
count = dataset.astype(str).applymap(lambda x: x.count('<')).sum().sum()
assert count==0

In [307]:
# check value range in numerical columns
dataset.describe()

Unnamed: 0,id,year_reviewed,cocoa_percent,num_ingredients,rating
count,2530.0,2530.0,2530.0,2530.0,2530.0
mean,1429.800791,2014.374308,-0.020942,0.004157,3.196344
std,757.648556,3.968267,1.020877,0.994361,0.445321
min,5.0,2006.0,-5.408159,-2.256401,1.0
25%,802.0,2012.0,-0.318972,-1.148941,3.0
50%,1454.0,2015.0,-0.318972,-0.04148,3.25
75%,2079.0,2018.0,0.408054,1.065981,3.5
max,2712.0,2021.0,5.133728,3.280902,4.0


## Save data

If all tests passed, we save the dataset in a `.csv` file in the `data` folder. Do not save the first column as it is just the index.

In [308]:
# save
path2store = path2data.replace(".csv", "_proc.csv")
dataset.to_csv(path2store)
print(f"Stored cleaned data in: {path2store}")


Stored cleaned data in: ../data/chocolate_bars_proc.csv
