# NER - From text to recipe
After the app was packed with OCR funcion, I need a solution to extract recipes from text. After some research I've decided to use NER - Named Entity Recongition with Tensorflow


## Dataset
The dataset used is the TASTEset


In [1]:
!git clone https://github.com/taisti/TASTEset

Cloning into 'TASTEset'...
remote: Enumerating objects: 41, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 41 (delta 8), reused 3 (delta 3), pack-reused 29[K
Receiving objects: 100% (41/41), 209.16 KiB | 2.90 MiB/s, done.
Resolving deltas: 100% (14/14), done.


## Analyzing the data

In [2]:
import pandas as pd
data_path = "TASTEset/data/TASTEset.csv"
data = pd.read_csv(data_path, encoding='utf8')

In [3]:
data.head()

Unnamed: 0,ingredients,ingredients_entities
0,5 ounces rum\n4 ounces triple sec\n3 ounces Ti...,"[{""start"": 0, ""end"": 1, ""type"": ""QUANTITY"", ""e..."
1,"2 tubes cinnamon roll, refrigerated, with icin...","[{""start"": 0, ""end"": 1, ""type"": ""QUANTITY"", ""e..."
2,4 ripe coconuts\n1 cup evaporated milk\n1 cup ...,"[{""start"": 0, ""end"": 1, ""type"": ""QUANTITY"", ""e..."
3,1 sheet graham cracker (broken in half)\n2 pie...,"[{""start"": 0, ""end"": 1, ""type"": ""QUANTITY"", ""e..."
4,1 (8 ounce) package crescent rolls\n8 slices d...,"[{""start"": 0, ""end"": 1, ""type"": ""QUANTITY"", ""e..."


In [4]:
data["ingredients_entities"][0]

'[{"start": 0, "end": 1, "type": "QUANTITY", "entity": "5"},{"start": 2, "end": 8, "type": "UNIT", "entity": "ounces"},{"start": 9, "end": 12, "type": "FOOD", "entity": "rum"},{"start": 13, "end": 14, "type": "QUANTITY", "entity": "4"},{"start": 15, "end": 21, "type": "UNIT", "entity": "ounces"},{"start": 22, "end": 32, "type": "FOOD", "entity": "triple sec"},{"start": 33, "end": 34, "type": "QUANTITY", "entity": "3"},{"start": 35, "end": 41, "type": "UNIT", "entity": "ounces"},{"start": 42, "end": 51, "type": "FOOD", "entity": "Tia Maria"},{"start": 52, "end": 54, "type": "QUANTITY", "entity": "20"},{"start": 55, "end": 61, "type": "UNIT", "entity": "ounces"},{"start": 62, "end": 74, "type": "FOOD", "entity": "orange juice"}]'

The labels are formatted in JSON where each entity holds its position inside the ingredient sentence

In [8]:
 import json
 ingredients_entities = json.loads(data.at[0, "ingredients_entities"])

In [9]:
ingredients_entities

[{'start': 0, 'end': 1, 'type': 'QUANTITY', 'entity': '5'},
 {'start': 2, 'end': 8, 'type': 'UNIT', 'entity': 'ounces'},
 {'start': 9, 'end': 12, 'type': 'FOOD', 'entity': 'rum'},
 {'start': 13, 'end': 14, 'type': 'QUANTITY', 'entity': '4'},
 {'start': 15, 'end': 21, 'type': 'UNIT', 'entity': 'ounces'},
 {'start': 22, 'end': 32, 'type': 'FOOD', 'entity': 'triple sec'},
 {'start': 33, 'end': 34, 'type': 'QUANTITY', 'entity': '3'},
 {'start': 35, 'end': 41, 'type': 'UNIT', 'entity': 'ounces'},
 {'start': 42, 'end': 51, 'type': 'FOOD', 'entity': 'Tia Maria'},
 {'start': 52, 'end': 54, 'type': 'QUANTITY', 'entity': '20'},
 {'start': 55, 'end': 61, 'type': 'UNIT', 'entity': 'ounces'},
 {'start': 62, 'end': 74, 'type': 'FOOD', 'entity': 'orange juice'}]

I need them in the form of an array where each index holds a specific tag that corresponds to a word in the sentence.<br>
I.E: "5 grams of sugar" -> ["QUANTITY", "UNIT", "NONE", "FOOD"]

In [10]:
labels = []
sentences = []

for j in range(0,len(data)):
  # getting the entities and the sentences
  ingredients_entities = json.loads(data.at[j, "ingredients_entities"])
  chars = list(data["ingredients"][j].replace("\n"," "))

  # replacing every character with asterisks
  for x in range(0,len(chars)):
    if chars[x]!= " " and chars[x]!="\n":
      chars[x]= "*"

  previous = list(data["ingredients"][j])

  while len(ingredients_entities)>0:
    word = chars[ingredients_entities[-1]["start"]:ingredients_entities[-1]["end"]]
    # if an entity has more than one word I want to repeat the tag
    replace_count = len("".join(word).split(" "))
    replace = ingredients_entities[-1]["type"]
    for i in range(1,replace_count):
      replace = replace + " " +ingredients_entities[-1]["type"]
    # now I replace the asterisk corresponding to that word with the tags
    chars[ingredients_entities[-1]["start"]:ingredients_entities[-1]["end"]] = replace
    last_index = len(ingredients_entities) - 1
    # I am done with that entity
    ingredients_entities.pop(last_index)
  tags = "".join(chars).replace("\n"," ").split(" ")
  #print(tags)
  for w in range(0,len(tags)):
    # ignoring empty words
    if(len(tags[w]) == 0):
      tags.pop(w)
      continue
    # now replacing meaningless words with the tag "O"
    onlystars = True
    for c in list(tags[w]):
      if c != "*" and c != "\n":
        onlystars = False
        break
    if onlystars:
      tags[w] = "O"
  #print(" ".join(tags).replace("*",""))
  #print("".join(previous))

  tagsArr =  " ".join(tags).replace("*","").split(" ")
  sentenceArr = "".join(previous).replace("\n"," ").split(" ")
  #print(sentenceArr)
  # removing the empty word
  if sentenceArr[-1] == "":
    sentenceArr.pop(len(sentenceArr)-1)
  #print(str(len(tagsArr))+"=="+str(len(sentenceArr)))

  # finally I can add sentences and label to their respective array
  sentences.append(" ".join(sentenceArr))
  labels.append(tagsArr)

df = pd.DataFrame({'sentences':sentences, 'labels':labels})
df.head()



Unnamed: 0,sentences,labels
0,5 ounces rum 4 ounces triple sec 3 ounces Tia ...,"[QUANTITY, UNIT, FOOD, QUANTITY, UNIT, FOOD, F..."
1,"2 tubes cinnamon roll, refrigerated, with icin...","[QUANTITY, UNIT, FOOD, FOOD, PROCESS, FOOD, FO..."
2,4 ripe coconuts 1 cup evaporated milk 1 cup gi...,"[QUANTITY, PHYSICAL_QUALITY, FOOD, QUANTITY, U..."
3,1 sheet graham cracker (broken in half) 2 piec...,"[QUANTITY, UNIT, FOOD, FOOD, PROCESS, PROCESS,..."
4,1 (8 ounce) package crescent rolls 8 slices de...,"[QUANTITY, QUANTITY, UNIT, UNIT, FOOD, FOOD, Q..."
