## Classify recepies
Example of data gathered from a relational database. Data are not available for testing. The relevant relations are of the form:
```
recipe(id, ..., fat)
n100g(recipe, nfactor, nvalue)
```
`n100g` provides info on the nutrients per 100g. The task is to predict the fat class of the recipe (red, orange, green)

## Connect the database and perform the query

In [1]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

In [2]:
from sqlalchemy import create_engine

In [3]:
import json

In [4]:
confile = '/Users/flint/Data/psql/conf.json'
with open(confile, 'r') as infile:
    conf = json.load(infile)

In [5]:
user, database = 'postgres', 'recipes'
engine = create_engine("postgresql+psycopg2://{}:{}@localhost:5432/{}".format(
    user, conf['psw'], database))

In [6]:
sql = """
SELECT R.id, R.fat, N.nvalue, N.nfactor
FROM rcp.recipe AS R
JOIN rcp.n100g AS N ON R.id = N.recipe
"""

In [7]:
data = pd.read_sql(sql, engine)

In [8]:
data.head()

Unnamed: 0,id,fat,nvalue,nfactor
0,ce5a91d93e,red,402.746,nrg
1,ce5a91d93e,red,30.601,fat
2,ce5a91d93e,red,7.422,sod
3,ce5a91d93e,red,17.563,sat
4,ce5a91d93e,red,1.074,sug


## Task 1: create a features and target dataset for classification

In [13]:
features = data.nfactor.unique()
X = np.zeros((len(data['id'].unique()), len(features)))

In [17]:
r2t = dict(data.groupby(['id', 'fat']).count().index.values)

In [14]:
X.shape

(10000, 6)

In [29]:
recepies = list(r2t.keys())
y = [None]*len(recepies)

for i, row in data.iterrows():
    r = row['id']
    fat = row['fat']
    nf = row['nfactor']
    nv = row['nvalue']
    r2index = recepies.index(r)
    y[r2index] = fat
    X[r2index,list(features).index(nf)] = nv

## Task 2: create a train and test setting

In [33]:
from sklearn.model_selection import train_test_split

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Task 3: train a classifier

In [35]:
from sklearn.ensemble import RandomForestClassifier

In [36]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

RandomForestClassifier()

In [37]:
y_pred = rf.predict(X_test)

## Task 4: evaluate the classifiers and create a confusion matrix

In [38]:
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, adjusted_rand_score

In [39]:
print(classification_report(y_test, y_pred, zero_division=0))

              precision    recall  f1-score   support

       green       1.00      1.00      1.00       833
      orange       1.00      1.00      1.00       883
         red       1.00      1.00      1.00       784

    accuracy                           1.00      2500
   macro avg       1.00      1.00      1.00      2500
weighted avg       1.00      1.00      1.00      2500

