# Cleaning the Chocolate Dataset

This is an auxiliary notebook that documents the steps taken to clean the dataset from the [Chocolate Bar Ratings Dataset](https://www.kaggle.com/datasets/evangower/chocolate-bar-ratings) and make it ready for use in the tutorials. It uses functions in the `clean_chocolate_dataset.py` module found in the same folder.

For more information on the abbreviations of the dataset [see here](https://valeursnutritives.ch/en/abbreviations/).

## Setup

In [1]:
# importing required packages
import os
from urllib import request
import clean_chocolate_dataset

# setting jupyter notebook settings to reload custom packages
%load_ext autoreload
%autoreload 2

In [2]:
# URL of the dataset
url_data = "https://www.kaggle.com/datasets/evangower/chocolate-bar-ratings/download?datasetVersionNumber=1"

# path to store the raw data
path2data = "../data/chocolate_bars.csv"

## Fetch data

In [3]:
if not os.path.isfile(path2data):
    response = request.get(url_data)

    with open(path2data, 'wb') as file:
        file.write(response.content)
else:
    print(f'File {path2data} already exists.')

File ../data/chocolate_bars.csv already exists.


## Clean data

In [4]:
#dataset = pd.read_csv(path2data)
dataset = clean_chocolate_dataset.process_dataset(path2data)
dataset.head()

Before imputation:
Train set missing values: 
bean_origin         0
year_reviewed       0
cocoa_percent       0
num_ingredients    75
rating              0
dtype: int64
Test set missing values: 
bean_origin         0
year_reviewed       0
cocoa_percent       0
num_ingredients    12
rating              0
dtype: int64
After imputation:
Train set missing values: 
bean_origin        0
year_reviewed      0
cocoa_percent      0
num_ingredients    0
rating             0
dtype: int64
Test set missing values: 
bean_origin        0
year_reviewed      0
cocoa_percent      0
num_ingredients    0
rating             0
dtype: int64
Standardizing the data...
Standardization done!
0       1
1       1
2       1
3       1
4       1
       ..
2525    0
2526    1
2527    1
2528    1
2529    1
Name: year_binary, Length: 2530, dtype: int64


Unnamed: 0_level_0,cocoa_percent,num_ingredients,rating,split,year_binary,country_peru,country_venezuela
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0.771568,-0.04148,3.25,train,1,0,0
2,0.771568,-0.04148,3.5,test,1,0,0
3,0.771568,-0.04148,3.75,train,1,0,0
4,-0.682486,-0.04148,3.0,train,1,0,0
5,0.044541,-0.04148,3.0,train,1,0,1


## Check data

In [5]:
# check columns in data
dataset.columns

Index(['cocoa_percent', 'num_ingredients', 'rating', 'split', 'year_binary',
       'country_peru', 'country_venezuela'],
      dtype='object')

In [6]:
# check value range in numerical columns
dataset.describe()

Unnamed: 0,cocoa_percent,num_ingredients,rating,year_binary,country_peru,country_venezuela
count,2530.0,2530.0,2530.0,2530.0,2530.0,2530.0
mean,-0.020942,0.004157,3.196344,0.507115,0.096443,0.1
std,1.020877,0.994361,0.445321,0.500048,0.295256,0.300059
min,-5.408159,-2.256401,1.0,0.0,0.0,0.0
25%,-0.318972,-1.148941,3.0,0.0,0.0,0.0
50%,-0.318972,-0.04148,3.25,1.0,0.0,0.0
75%,0.408054,1.065981,3.5,1.0,0.0,0.0
max,5.133728,3.280902,4.0,1.0,1.0,1.0


## Save data

If all tests passed, we save the dataset in a `.csv` file in the `data` folder. Do not save the first column as it is just the index.

In [7]:
# save
path2store = path2data.replace(".csv", "_proc.csv")
dataset.to_csv(path2store, index=True)
print(f"Stored cleaned data in: {path2store}")


Stored cleaned data in: ../data/chocolate_bars_proc.csv
