The following code is a simple program to get category classes on OpenFoodFacts data. I use the first 10K records from https://world.openfoodfacts.org/data, and select the records with non-empty category fields (field/attribute index = 14). I employ the SVClassifier from scikit-learn to perform the classification. The attribute used for classification is ingredients (field/attribute index = 34). 

Import the packages

In [1]:
import io
import string
import re
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import average_precision_score

Category index lookup function

In [2]:
def get_cat_idx(cat, cat_text):
    cat_idx = -1
    stop = False
    i = 0
    while i<len(cat) and not stop:
        if cat[i]==cat_text:
            cat_idx = i
            stop = True 
        i = i + 1        
    return cat_idx

Read the dataset file

In [3]:
categories = []
ingredients = []
food_cat = []

with open('SmallFile10K.txt') as f:
    f.readline()
    content = [x.strip('\n') for x in f.readlines()]
    for content_line in content:
        content_line_split = content_line.split('\t')
        if content_line_split[34] != '' and content_line_split[14] != '':
            cat_line = content_line_split[14].split(',')
            ing = re.sub(r'[^\w\s]','',content_line_split[34]) # remove the punctuation marks in ingredients text            
            ingredients.append(ing)
            curr_cat = []
            for cat_item in cat_line:                
                if cat_item not in categories:
                    categories.append(cat_item)
                curr_cat.append(get_cat_idx(categories, cat_item))
            food_cat.append(curr_cat)

Set the training and test dataset

In [4]:
X = np.array(ingredients)
mlb = MultiLabelBinarizer() 
Y = mlb.fit_transform(food_cat) # transform the category class to binary representation

# Split into training and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2,
                                                    random_state=30)

Classify

In [5]:
classifier = Pipeline([
    ('vectorizer', CountVectorizer()),('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, Y_train)
predicted = classifier.predict(X_test)

  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classe

See the results

In [6]:
for item, labels, true_labels in zip(X_test, predicted, Y_test):
    print('item: '+item)
    for label_elmt in range(0, len(labels)):
        if labels[label_elmt]==1:
            print('label: '+str(label_elmt)+','+categories[label_elmt])
    for true_label_elmt in range(0, len(true_labels)):
        if true_labels[true_label_elmt]==1:
            print('true_label: '+str(true_label_elmt)+','+categories[true_label_elmt])
    print('\n')

item: ENRICHED DURUM _WHEAT_ SEMOLINA SEMOLINA _WHEAT_ NIACIN IRON LACTATE THIAMINE MONONITRATE RIBOFLAVIN FOLIC ACID
true_label: 351,pasta


item: Tomato puree water tomato paste sugar white vinegar red jalapeno puree red jalapeno peppers salt citric acid salt red chili puree red chili peppers salt citric acid garlic puree garlic water citric acid onion powder garlic spices c
true_label: 393,Ketchup


item: Molkenproteinkonzentrat 99_Wheyproteinkonzentrat_ nur Schoko Kakaopulver entölt Aroma Pflanzenöl aus Raps Süßungsmittel AcesulfamK und Sucralose
label: 1043,Proteinpulver
true_label: 1043,Proteinpulver


item: Sultanas 19 Sugar Water Fortified Wheat Flour Wheat Flour Calcium Carbonate Iron Niacin Thiamin Raisins 14 Currants 4 Palm Oil Humectant Vegetable Glycerine Candied Citrus Peel Glucose Syrup Orange Peel Sugar Lemon Peel Acid Citric Acid Pasteurised Free Range Egg Apple Juice from Concentrate Molasses Orange and Lemon Peel Mixed Spice Coriander Cinnamon Clove Fennel Ginger Nut