**Purpose of this notebook**
- [X] to be a tutorial of simple machine learning tools
- [X] first contact with machine learning and/or openfoodfacts data

It is proposed to estimate the `nutriscore` thanks to the machine learning as a tutorial. In this journey, we're going to share some key practices to build a machine learning model.

**What are in this notebook**
- [ ] look at the product data
- [ ] manipulate simple machine learning tools
- [ ] estimate nutriscore and evaluate one model

**WARNING**: you will need to install some packages. Prepare a specific virtual environment is recommanded.

In [None]:
# # to install basic packages
# # numpy: numeric python package. The basic of array manipulation
# # pandas: based on R DataFrame. Helper to manipulate table data
# # scikit-learn: the scientific kit for "classic" and basic machine learning
# # tqdm: to see progress bar
# !pip install -U scikit-learn matplotlib numpy pandas tqdm missingno

# Data preparation

This notebook was developped with specific relative path regarding the data. If your data repository is elsewhere or the files have different, change the command in the notebook accordingly. Ask or look for help if you struggle.

Load the data of openfoodfacts. This data was extracted from the [dump Mongo DB available](https://fr.openfoodfacts.org/data). Only some fields of the documents were retrieved, particularly the ingredients and the evaluation score such as nutriscore, ecoscore and nova group.

The data is a list of dict saved as a JSON text file. Each element of of the list is a product and the key-value pairs are the information of the product.

## load original dataset

In [None]:
from pathlib import Path
import json

DATA_PATH = Path('../../data').resolve() # resolve from Path get the real absolute path

In [None]:
# open a file with `with` in `r` (reading mode) avoid to close it explicitly
with open(DATA_PATH / 'products.json', 'r') as file: 
    data = json.load(file)
    
# you can also read the csv file with this following command
# from pandas import DataFrame, read_csv
# df = read_csv(DATA_PATH / 'products.csv')

In [None]:
data[0].keys()

## select/calculate relevant features of our products link to nutriscore

In [None]:
from pandas import DataFrame, Series
from tqdm import tqdm
from typing import Dict, Any, List

In [None]:
def get_name(code: str, data:Dict[str, Any])-> str:
    """return the name of the product"""
    return [i.get('product_name') for i in data if i.get('_id') == code][0]

The first step of every data scientist and Machine learning engineer is to filter the data correctly. In theory, if we have a lot of data, it is possible to give all of the key-value as data and let the model select itself the key-values or features that provide predictive power. Be smart!! Save time, energy and money!!

We want to estimate the nutriscore. The nutriscore is evaluate from the nutrition caracteristics of the food. Let's use the nutriment data of the products.

It is possible also to calculate features from raw data if they seems to be more relevant. For instance, it is very common to use Fast Fourier Transform (FFT) coefficients to treat sound data or vibration.

**Warning**: sometimes it is obvious which features is related to the target (here nutriscore), sometimes not.

In [None]:
# For each products, select the nutriment data for 100g and 100ml
# by chance, each product has the information for one portion.
# if not, it should be calculated as it helps to compare products

# by standard notation, X is features and y is target/label
# I use X_ and y_ because they are temporary files to build table/array data.
X_, y_ = {}, {} 
for product in tqdm(data):
    X_[product['_id']] = {k:v for k, v in product['nutriments'].items() if '_100g' in k or '_100ml' in k}
    y_[product['_id']] = product['nutriscore_grade']

In [None]:
X = DataFrame.from_dict(X_, orient='index')
y = Series(y_)

In [None]:
display(X.head(5))
display(X.tail(5))

It seems there is a lot of Not a Number (NaN). Let's look.

In [None]:
code = '5414972123165'
print(get_name(code, data))
X.loc[code].dropna()

In [None]:
import missingno as msno

msno.matrix(X);

In [None]:
# Drop everycolumns with more than 20% of NaN
perc = 20.0 # Like N %
min_count =  int(((100-perc)/100)*X.shape[0] + 1)
msno.matrix(X.dropna( axis=1, 
                thresh=min_count))

In [None]:
import matplotlib.pyplot as plt

plt.scatter(X['nutrition-score-fr_100g'], y)

Something is weird with the nutrition score, because it goes from -10 to 30 for every category.

The information is here. By convenience, put every NaN to a value, here 0.0.

In [None]:
X = X.dropna(axis=1, thresh=min_count)
X.drop(columns='nutrition-score-fr_100g')
X = X.fillna(0)

There are some steps before to try and train a model:
- **split my data** into a train and a test set. It is unthinkable to evaluate the model on data that are used for training. For explaination, look for `data leaking` on internet.
- **set a scaler** to help the model to correlate the variation of feature with the variation of the target more easily. Very impacting for kind of model such as linear one.
- **set a label encoder for target** because it is easier to manipulate figure than text for machine learning tool.

In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# build and set the label encoder
# it exists other encoder such as one hot encoder. But luckily here, there is a ranking, a hierarchy between the target value.
# A is good when e is very bad.
# LabelEncoder fit this kind of label very well.
encoder = LabelEncoder()
encoder.fit(y)

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    stratify=y,
    random_state=42,
    train_size=0.7,
    )

scaler = StandardScaler()
scaler.fit(X_train); # never fit the scaler on all data, but only on train otherwise the sky will fall on you!

In [None]:
print(f'{X_train.shape=}')
print(f'{y_train.shape=}')

In [None]:
tmp = sorted(y.unique())
for i, j in zip(tmp, encoder.transform(tmp)):
    print(i,':', j)
del tmp

In [None]:
# warning: the output of sklearn preprocessor is numpy array
print('X_train:', type(scaler.transform(X_train)))
print('y_train:', type(encoder.transform(y)))

In [None]:
display(X_train.iloc[0])
display(scaler.transform(X_train)[:, 0]) # the value change

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2)

X_train.iloc[0:10, 0:10].boxplot(rot=90, ax=axes[0])

axes[0].set_title('original')
axes[1].set_title('After scaler')

DataFrame(
    data=scaler.transform(X_train)[0:10, 0:10],
    columns=X_train.columns[0:10],
    index=X_train.index[0:10]
).boxplot(rot=90, ax=axes[1])

plt.show()

## Build and evaluate our model

**Good tips**:  
- If possible, try small and fast, and complexify your experimentation.
- Why not to train on small datasets and verify if it works at > 80%

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=10_000)
model.fit(
    scaler.transform(X_train.iloc[0:500]),
    encoder.transform(y_train.iloc[0:500])
    )

In [None]:
# on itself
y_pred_500 = model.predict(scaler.transform(X_train.iloc[0:500]))
# on unseen data
y_pred_1000 = model.predict(scaler.transform(X_train.iloc[500:1000]))

In [None]:
from sklearn.metrics import (
    accuracy_score,
    balanced_accuracy_score,
    confusion_matrix,
    multilabel_confusion_matrix
)

first_test = balanced_accuracy_score(
    y_true=encoder.transform(y_train.iloc[0:500]),
    y_pred=y_pred_500
    )

second_test = balanced_accuracy_score(
    y_true=encoder.transform(y_train.iloc[500:1000]),
    y_pred=y_pred_1000
    )

print(f'{first_test=}')
print(f'{second_test=}')

In [None]:
# tests are very bad. Even on training dataset on which the model was trained, the result is poor.
multilabel_confusion_matrix(
    y_true=encoder.transform(y_train.iloc[0:500]),
    y_pred=y_pred_500
)

In [None]:
# Is there a change if we train on all the training data

model = LogisticRegression(max_iter=10_000)
model.fit(
    scaler.transform(X_train),
    encoder.transform(y_train)
    )

In [None]:
# predict on itself
y_pred_train = model.predict(scaler.transform(X_train))
# on unseen data
y_pred_test = model.predict(scaler.transform(X_test))

In [None]:
first_test = balanced_accuracy_score(
    y_true=encoder.transform(y_train),
    y_pred=y_pred_train
    )

second_test = balanced_accuracy_score(
    y_true=encoder.transform(y_test),
    y_pred=y_pred_test
    )

print(f'{first_test=}')
print(f'{second_test=}')

In [None]:
Series(y_pred_test).value_counts()

In [None]:
Series(encoder.transform(y_test)).value_counts()

In [None]:
def test_model(
    model,
    X_train=X_train, X_test=X_test,
    y_train=y_train, y_test=y_test,
    scaler=scaler, encoder=encoder
    ):
    model.fit(
        scaler.transform(X_train),
        encoder.transform(y_train)
        )
    
    # predict on itself
    y_pred_train = model.predict(scaler.transform(X_train))
    # on unseen data
    y_pred_test = model.predict(scaler.transform(X_test))

    evaluation_on_train = balanced_accuracy_score(
        y_true=encoder.transform(y_train),
        y_pred=y_pred_train
        )
    evaluation_on_test = balanced_accuracy_score(
        y_true=encoder.transform(y_test),
        y_pred=y_pred_test
        )
    print(f'{model=}')
    print(f'{evaluation_on_train=}')
    print(f'{evaluation_on_test=}')

## Try the same approach but with other model

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=500, max_depth=10)
test_model(model)

In [None]:
from sklearn.neural_network import MLPClassifier

model = MLPClassifier(
    hidden_layer_sizes=(512, 512, 256, 64),
    learning_rate_init=0.001,
    learning_rate='adaptive',
    random_state=42,
    max_iter=100,
    )
test_model(model)