# Recipe Classification using text data

In order to try this notebook, you need prepare your own data. <br>
Directory structure is assumed as following (category0 or 1 can be replaced with the name of the category). <br>

- /work/data/text/recipes
  - /category0_negative.pkl
  - /category0_positive.pkl
  - /category1_negative.pkl
  - /category1_positive.pkl
  - ...

Data format is assumed to be json format.

The topic is the recipe classification using {title, ingredient, step} text data.

We treat this problem as a binary classification {0: The target recipe,1: Not the target recipe}; we just construct classification models whose input is preprocessed text features.

- **1. Prepare dataset** <br>
- **2. Construct a Random Forest model** <br>
- **3. Construct a Xgboost model** <br>
- **4. How can we improve the model?** <br>

An example of recipe data. 

In [None]:
{
    1: {'title': "激ウマ！焼き餃子", 'ingredients': ["豚肉","にんにく","にら"], 'steps': ["たねを作る","皮に包む","美味しく焼く"]}
    , 2: {'title': "最高！焼き餃子", 'ingredients': ["豚肉","にんにく","にら","春菊"], 'steps': ["たねを作る","皮に包む","美味しく焼く"]}
}

Here we prepared a sample data to try this notebook as below.
- */work/data/text/recipes/sample_positive.pkl* 
- */work/data/text/recipes/sample_negative.pkl* 

## 1. Prepare dataset

In Japanese text analysis, we need to do morphological analysis in order to break down text to set of words.

Here we use Mecab ( http://taku910.github.io/mecab/ ) for morphological analysis.

In [1]:
import re
import os
import pickle

import MeCab
import numpy as np

from scipy.sparse import vstack

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, f1_score

from concurrent.futures import ThreadPoolExecutor

### Define signs which will be eliminated and load cookpad dictionary.

We assume you have already downloaded the data from the s3 bucket. 

Various signs may become noises in modeling, so we omit signs.

Since WE have the useful cookpad dictionary, we use it.

[reference] Check the behavior with the default dictionary.

老干媽 is a kind of chilli oil. Recipes: https://cookpad.com/search/%E8%80%81%E5%B9%B2%E5%AA%BD?order=date&page=1

In [2]:
MECAB_TAGGER= MeCab.Tagger("-Ochasen")
print( MECAB_TAGGER.parse("老干媽") )
print( MECAB_TAGGER.parse("老干媽") )

老	ロウ	老	接頭詞-名詞接続		
干	ヒ	干る	動詞-自立	一段	連用形
媽	媽	媽	名詞-一般		
EOS

老	ロウ	老	接頭詞-名詞接続		
干	ヒ	干る	動詞-自立	一段	連用形
媽	媽	媽	名詞-一般		
EOS



In [3]:
print( MECAB_TAGGER.parse(
    "ザー菜は粗いみじん切りにする。ボールに豚肉、たね用の材料、ザー菜を入れ、全体に粘りが出るまで手で練り混ぜ、たねを作る。"
) )

ザー	ザー	ザー	名詞-一般		
菜	サイ	菜	名詞-一般		
は	ハ	は	助詞-係助詞		
粗い	アライ	粗い	形容詞-自立	形容詞・アウオ段	基本形
みじん切り	ミジンギリ	みじん切り	名詞-一般		
に	ニ	に	助詞-格助詞-一般		
する	スル	する	動詞-自立	サ変・スル	基本形
。	。	。	記号-句点		
ボール	ボール	ボール	名詞-一般		
に	ニ	に	助詞-格助詞-一般		
豚肉	ブタニク	豚肉	名詞-一般		
、	、	、	記号-読点		
たね	タネ	たね	名詞-固有名詞-人名-名		
用	ヨウ	用	名詞-接尾-一般		
の	ノ	の	助詞-連体化		
材料	ザイリョウ	材料	名詞-一般		
、	、	、	記号-読点		
ザー	ザー	ザー	名詞-一般		
菜	サイ	菜	名詞-一般		
を	ヲ	を	助詞-格助詞-一般		
入れ	イレ	入れる	動詞-自立	一段	連用形
、	、	、	記号-読点		
全体	ゼンタイ	全体	名詞-副詞可能		
に	ニ	に	助詞-格助詞-一般		
粘り	ネバリ	粘る	動詞-自立	五段・ラ行	連用形
が	ガ	が	助詞-格助詞-一般		
出る	デル	出る	動詞-自立	一段	基本形
まで	マデ	まで	助詞-副助詞		
手	テ	手	名詞-一般		
で	デ	で	助詞-格助詞-一般		
練り	ネリ	練る	動詞-自立	五段・ラ行	連用形
混ぜ	マゼ	混ぜる	動詞-自立	一段	連用形
、	、	、	記号-読点		
たね	タネ	たね	名詞-固有名詞-人名-名		
を	ヲ	を	助詞-格助詞-一般		
作る	ツクル	作る	動詞-自立	五段・ラ行	基本形
。	。	。	記号-句点		
EOS



Define the signs which will be eliminated because probably they do not affect the model performance.

**If you have your own dictionary, you can set it here (the dict file is assumed to be set on /work/data/text/YOUR_DICT).** <br>
e.g.,) In Cookpad, we have our dictionary which is specialized in recipes.

In [4]:
SIGNS = "，．・：；？！゛゜´｀¨＾￣＿ヽヾ〃仝〆〇‐／∥｜…‥‘’“”〔〕［］｛｝〈〉《》「」『』【】\
        ＋－±×÷＝≠＜＞≦≧∞∴♂♀°′″℃￥＄￠￡％＃＆＊＠§☆★○●◎◇字◆□■△▲▽▼※〒→←↑↓〓\
       ∈∋⊆⊇⊂⊃∪∩∧∨￢⇒⇔∀∃∠⊥⌒∂∇≡≒≪≫√∽∝∵∫∬Å‰♯♭♪†‡¶◯';♥♡♫❤✿。◆◇♢♦❖∮彡☺～α✾✣⁂/*"

# MECAB_USER_DIC_PATH = os.path.join('..', 'data', 'text', YOUR OWN DICT)
# MECAB_TAGGER= MeCab.Tagger("-Ochasen -u %(MECAB_USER_DIC_PATH)s" % globals())

In [5]:
print( MECAB_TAGGER.parse("老干媽") )
print( MECAB_TAGGER.parse("老干媽") )

老干媽			名詞-一般		
EOS

老干媽			名詞-一般		
EOS



Check morphological analysis using a test sentence.

In [6]:
print( MECAB_TAGGER.parse(
    "ザー菜は粗いみじん切りにする。ボールに豚肉、たね用の材料、ザー菜を入れ、全体に粘りが出るまで手で練り混ぜ、たねを作る。"
) )

ザー菜			名詞-一般		
は	ハ	は	助詞-係助詞		
粗い	アライ	粗い	形容詞-自立	形容詞・アウオ段	基本形
みじん切り			名詞-一般		
に	ニ	に	助詞-格助詞-一般		
する	スル	する	動詞-自立	サ変・スル	基本形
。	。	。	記号-句点		
ボール			名詞-一般		
に	ニ	に	助詞-格助詞-一般		
豚肉	ブタニク	豚肉	名詞-一般		
、	、	、	記号-読点		
たね			名詞-一般		
用	ヨウ	用	名詞-接尾-一般		
の	ノ	の	助詞-連体化		
材料	ザイリョウ	材料	名詞-一般		
、	、	、	記号-読点		
ザー菜			名詞-一般		
を	ヲ	を	助詞-格助詞-一般		
入れ	イレ	入れる	動詞-自立	一段	連用形
、	、	、	記号-読点		
全体	ゼンタイ	全体	名詞-副詞可能		
に	ニ	に	助詞-格助詞-一般		
粘り	ネバリ	粘る	動詞-自立	五段・ラ行	連用形
が	ガ	が	助詞-格助詞-一般		
出る	デル	出る	動詞-自立	一段	基本形
まで	マデ	まで	助詞-副助詞		
手	テ	手	名詞-一般		
で	デ	で	助詞-格助詞-一般		
練り	ネリ	練る	動詞-自立	五段・ラ行	連用形
混ぜ	マゼ	混ぜる	動詞-自立	一段	連用形
、	、	、	記号-読点		
たね			名詞-一般		
を	ヲ	を	助詞-格助詞-一般		
作る			名詞-一般		
。	。	。	記号-句点		
EOS



Try your own sentence!

e.g.,) 今日はとても良い天気なので、公園で論文を読もう。

In [7]:
print( MECAB_TAGGER.parse("今日はとても良い天気なので、公園で論文を読もう。") )

今日			名詞-一般		
は	ハ	は	助詞-係助詞		
とても			名詞-一般		
良い	ヨイ	良い	形容詞-自立	形容詞・アウオ段	基本形
天気	テンキ	天気	名詞-一般		
な	ナ	だ	助動詞	特殊・ダ	体言接続
ので	ノデ	ので	助詞-接続助詞		
、	、	、	記号-読点		
公園			名詞-一般		
で	デ	で	助詞-格助詞-一般		
論文	ロンブン	論文	名詞-一般		
を	ヲ	を	助詞-格助詞-一般		
読も	ヨモ	読む	動詞-自立	五段・マ行	未然ウ接続
う	ウ	う	助動詞	不変化型	基本形
。	。	。	記号-句点		
EOS



### Deifine some functions for modeling.

Return the recipe text for each data and for each p(ositive) or n(egative) label.

In [9]:
def recipes_for_label(label):
    with open('../data/text/recipes/%(label)s_positive.pkl' % locals(), 'rb') as f:
        p = pickle.load(f)
    
    with open('../data/text/recipes/%(label)s_negative.pkl' % locals(), 'rb') as f:
        n = pickle.load(f)
        
    return p, n

Check the function. Here we use sample data (this is just a simple and virtual data).

In [10]:
pos, neg = recipes_for_label("sample")

In [11]:
pos_idx = 1
pos[pos_idx]

{'ingredients': ['豚肉', 'にんにく', 'にら'],
 'steps': ['たねを作る', '皮に包む', '美味しく焼く'],
 'title': '激ウマ！焼き餃子'}

In [12]:
neg_idx = 1
neg[neg_idx]

{'ingredients': ['パスタ', '卵', 'ベーコン', '生クリーム'],
 'steps': ['パスタを茹でる', 'ソースをいい感じに作っておく', '和える'],
 'title': 'オシャレパスタ'}

Preprocessing functions: eliminate the signs and extract nouns.

In this analysis we ONLY use noun as input features.

In [13]:
def _clean(in_str):
    return re.sub(r'\d+', ' ', re.sub(r"[%(SIGNS)s]+" % globals(), ' ', in_str))

def _extract_nouns(in_str):
    words = [ws.split('\t') for ws in MECAB_TAGGER.parse(in_str).split('\n')[:-2]]
    nouns = [w[0] for w in words if w[3].startswith('名')] # use only noun for input features
    return ' '.join(nouns)

def clean_and_extract_nouns(in_str):
    return _extract_nouns(_clean(in_str))

Check the function.

In [14]:
text = "キラッ☆流星にまたがって"
print( _clean(text) )
print( _extract_nouns(_clean(text)) )

キラッ 流星にまたがって
キラッ 流星


Preprocessing: concat preprocessed {title, ingredients, steps} texts and return RECIPEID: TEXT.

In [15]:
def tokenize_recipes(recipes):
    def flatten_recipe(recipe):
        return ' '.join([recipe['title']] + recipe['ingredients'] + recipe['steps'])

    with ThreadPoolExecutor() as executor:
        processed_recipes = executor.map(clean_and_extract_nouns, map(flatten_recipe, recipes.values()))
        result = {rid: recipe for rid, recipe in zip(recipes.keys(), processed_recipes)}

    return result

Check the function.

In [16]:
tokenize_recipes(pos)

{1: '激 ウマ 焼き 餃子 豚肉 にんにく にら たね 作る 皮 包む 美味しく 焼く',
 2: '最高 焼き 餃子 豚肉 にんにく にら 春菊 たね 作る 皮 包む 美味しく 焼く'}

In [17]:
tokenize_recipes(neg)

{1: 'オシャレ パスタ パスタ 卵 ベーコン 生クリーム パスタ 茹でる ソース いい 感じ 和える',
 2: 'センス パスタ パスタ パンチェッタ オリーブオイル パスタ 茹でる ソース いい 感じ 和える'}

Preprocessing: compute tfidf from input and vectorize input.

tfidf ( https://en.wikipedia.org/wiki/Tf%E2%80%93idf ) is one of the most useful text features for machine learning.
- tf is term frequency: number of occurences of each word (counting is done sentence by sentence)
- idf is inverse document frequency: weighting each word in accordance with the occurence times in all documents <br>
  $$ \text{(weight for the target word) ~} \log \frac{ \text{(# of all documents)} }{ \text{(# of documents including the word)} } $$

In [18]:
def _get_idx_features(bowed_recipes):
    recipe_ids, features = zip(*bowed_recipes.items())
    return np.array(list(recipe_ids)), np.array(list(features))

def tfidf_vectorizer(bowed_recipes):
    v = TfidfVectorizer()
    _, r = _get_idx_features(bowed_recipes)
    return v.fit(r)

def vectorize_recipes(bowed_recipes, vectorizer):
    recipe_ids, r = _get_idx_features(bowed_recipes)
    return np.array(list(recipe_ids)), vectorizer.transform(r)

In [19]:
text_dict = {
    1: "うらみます うらみます うらみます"
    , 2: "うらみます あんた こと 死ぬまで"    
}

text_tfidf = tfidf_vectorizer(text_dict)
recipe_ids,text_vector = vectorize_recipes(text_dict, text_tfidf)

print(recipe_ids)
print(text_vector.toarray())

[1 2]
[[ 0.          1.          0.          0.        ]
 [ 0.53404633  0.37997836  0.53404633  0.53404633]]


Preprocessing: training data generator.

For simplicity, we treat data as numpy array; we simply stack vectorized preprocessed text features for X and labels for y.

Scipy sparse matrix enables us to treat sparse ( most of entries is zero ) data efiiciently.

In [20]:
def generate_train_data(label):
    p_recipes, n_recipes = recipes_for_label(label)

    p_recipes_tokenized = tokenize_recipes(p_recipes)
    n_recipes_tokenized = tokenize_recipes(n_recipes)

    all_recipes_tokenized = {**p_recipes_tokenized, **n_recipes_tokenized}
    
    v = tfidf_vectorizer(all_recipes_tokenized)

    p_recipe_ids, p_recipes_vectorized = vectorize_recipes(p_recipes_tokenized, v)
    n_recipe_ids, n_recipes_vectorized = vectorize_recipes(n_recipes_tokenized, v)

    p_labels = np.empty(p_recipes_vectorized.shape[0])
    p_labels.fill(0)
    n_labels = np.empty(n_recipes_vectorized.shape[0])
    n_labels.fill(1)

    recipe_ids = np.concatenate([p_recipe_ids, n_recipe_ids])
    y = np.concatenate([p_labels, n_labels])
    X = vstack([p_recipes_vectorized, n_recipes_vectorized])

    return (recipe_ids, X.toarray(), y)

## 2. Counstruct a random forest model

For starters, we try to construct a model of Random Forest ( https://en.wikipedia.org/wiki/Random_forest ) .

Random forest is an ensemble method of decision trees; it is powerful and stable (very nice model as a 1st choice). 

### Prepare your data and set the target recipe.

You can use "sample" labbel if you don't have any data; however the result is NOT meaningful since the data size is too small.

In [None]:
label = ""

In [None]:
pos, neg = recipes_for_label(label)

Generate data for modeling.

In [None]:
%%time

# This could take minutes
recipe_ids, X, y = generate_train_data(label)

In [None]:
recipe_ids[0:5]

In [None]:
y[0:5]

Split dataset into {train, test}; here we use 10% data as test.

random_state ensures the reproducibility.

In [None]:
recipe_ids_train, recipe_ids_test, X_train, X_test, y_train, y_test = train_test_split(
    recipe_ids, X, y, test_size=0.2, random_state=1729
)

Check the data.

In [None]:
print(len(y_train), sum(y_train))
print(len(y_test), sum(y_test))

Train a model.

Though RandomForestClassifier has variosu parameters ( http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html ), we use default parameters except for n_estimators which is the number of trees.

In [None]:
%%time
model = RandomForestClassifier(random_state=1729, n_estimators=50)
model.fit(X_train, y_train)

Check the model peformance.

In [None]:
y_pred_test = model.predict(X_test)
print(precision_score(y_test, y_pred_test))
print(recall_score(y_test, y_pred_test))
print(f1_score(y_test, y_pred_test))

Check the wrong predictions. **NOTE that we define 0:positive, 1:negative**

Answer: sweets <br>
Prediction: not sweets

In [None]:
wrong_recipe_ids = np.where( ( y_test == 0 ) & (y_pred_test  == 1 ) )[0]

In [None]:
wrong_recipe_ids

In [None]:
recipe_ids_test[wrong_recipe_ids[0]]

In [None]:
pos[recipe_ids_test[wrong_recipe_ids[0]]]

Answer: not sweets <br>
Prediction: sweets

In [None]:
wrong_recipe_ids = np.where( ( y_test == 1 ) & (y_pred_test  == 0 ) )[0]

In [None]:
wrong_recipe_ids

In [None]:
recipe_ids_test[wrong_recipe_ids[0]]

In [None]:
neg[recipe_ids_test[wrong_recipe_ids[0]]]

## 3. Construct a Xgboost model

Let's try to use Xgboost ( paper: https://arxiv.org/abs/1603.02754 , GitHub: https://github.com/dmlc/xgboost ) which is known as the most useful model for structured data, especially in kaggle ( https://www.kaggle.com/ ) .
Recently LightGBM ( GitHub: https://github.com/Microsoft/LightGBM ) is also said to be powerful.

Xgboost is similar to Random Forest; this is an emsenble method (and in most case weak learners are decision trees.)

Although Random Forest uses simple average for summarizing weak learners' results, Xgboost (precisely, Gradient Boosting) changes weights for each tree to minimize a objective function by computing the derivative of it.

For strucutered dataset, **DON'T THINK, USE Xgboost (or LightGBM)**.

Here we use xgboost.skleran.XGBClassifier; fortunately, we can use sklearn-like API.

In [None]:
from xgboost.sklearn import XGBClassifier

Set Xgboost parametes ( http://xgboost.readthedocs.io/en/latest/python/python_api.html ).

In [None]:
xgb = XGBClassifier(
        objective= 'binary:logistic'
        , learning_rate =0.4
        , n_estimators=50
        , max_depth=10
        , min_child_weight=1.2
        , gamma=0.4
        , subsample=0.7
        , colsample_bytree=0.7
        , scale_pos_weight=1
        , seed=1729
)

Train a model.

In [None]:
%%time
model = xgb.fit(X_train, y_train)

Check the model performance.

In [None]:
pred_test = model.predict(X_test)

print(precision_score(y_test, pred_test))
print(recall_score(y_test, pred_test))
print(f1_score(y_test, pred_test))

### (Additional task) Try to find good parameters.

Many machine learning models have various hyper parameters which are parameters human must give (not given through the training process).

Because model performances are very depending on these parameters, we need to tune them to get better result.

Unfortunately, the dependencies of the performances on hyper parameters are highly non-linear; all we can do is to use grid search or bayesian optimization.

Here we use the grid search to find better parameters.

In [None]:
# from sklearn.grid_search import GridSearchCV

**!! Skip this cell because it will take too much time !!**

In [None]:
# %%time

# param_test = {
#     'learning_rate': [0.2,0.3,0.4]
#     , 'n_estimators': [125,150,175]
#     , 'max_depth': [10,15,20]
# }
# gsearch = GridSearchCV(estimator = XGBClassifier( 
#     min_child_weight=1
#     , gamma=0
#     , subsample=0.8
#     , colsample_bytree=0.8
#     , objective= 'binary:logistic'
#     , nthread=4
#     , scale_pos_weight=1
#     , seed=1729
#     ), 
#     param_grid = param_test, scoring='roc_auc',n_jobs=1,iid=False, cv=5
# )

# gsearch.fit(X_train, y_train)

In [None]:
# gsearch.best_params_

### Use the original interface.

In [None]:
# %%time

# import xgboost as xgb

# dtrain = xgb.DMatrix(X_train, label=y_train) # DMatrix is a kind of data structure

# params = {
#         'objective': 'binary:logistic'
#         , 'eta' : 0.2
#         , 'max_depth': 10
#         , 'min_child_weight': 1.2
#         , 'gamma': 0.4
#         , 'subsample': 0.7
#         , 'colsample_bytree': 0.7
#         , 'lambda': 1
#         , 'alpha': 0
#         , 'seed': 1729
# }

# model = xgb.train(dtrain=dtrain, params=params, num_boost_round=175)

# dtest = xgb.DMatrix(X_test)

# prediction_raw = model.predict(dtest) # Model predicts probabilities of "True" as [0,1] values.

# thld = 0.5
# pred_test = [1 if elem > thld else 0 for elem in prediction_raw]

# print(precision_score(y_test, pred_test))
# print(recall_score(y_test, pred_test))
# print(f1_score(y_test, pred_test))

## 4. How can we improve the model?

- Data size is the most important factor
- Data cleansing
- Mofication of the model
- Rethinking problem settings
- ...

**It requires your creativity!!!**

## Questions

1. Please describe your ideas to improve the model performance
- Please implement your ideas and check the result
- Can you explain how the morphological analysis works?
- Can you explain tf-idf in detail?
- Can you explain what is the ensemble method?
- Can you explain the algorithm of the gradient boosting? <br>
  (especially the difference from Random Forest)
- Can you explain the meanings of the parameters of Random Forest or Xgboost?