# AIDock - AI assignment - Yaniv Alon

## Table of Contents:

* [Assignment description](#description)
    * [Description of the data](#data)
* [Data Reading and Data Cleaning](#read)
* [Prepare data for NLP](#Prepare)

* [Building the model](#models)

## Assignment description <a class="anchor" id="description"></a>

The goal: Given a string containing a description of a recipe return a json containing the ingredients and the recipe

You are advised to look at this as a classification problem that determines for each paragraph with what probability it’s label is ‘ingredients’ or ‘recipe’ 
The implementation will be done in python 3.6 or higher. Python packages in this assignment: pytorch, tensorflow, pandas, numpy and BeautifulSoup 


##### environment variables for this assignment

pandas 1.5.2

numpy 1.20.3

tensorflow  2.9.1

psutil  5.8.0

requests 2.26.0

beautifulsoup4  4.10.0

python 3.9.7 

#### Description of the data  <a class="anchor" id="data"></a>

The dataset made of 3 features: URL, instructions, and descriptions that were scraped from a website for model training. 

## Data Reading and Data Cleaning<a class="anchor" id="read"></a>

In [1]:
#import libraries 
import pandas as pd
import numpy as np
import tensorflow as tf
import warnings
import json
warnings.filterwarnings('ignore')


In [2]:
df =pd.read_csv('loaveandlemons_dataset.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,url,ingredients,instructions
0,0,https://www.loveandlemons.com/vegan-ramen/,1 recipe Mushroom Broth\n1 tablespoon rice vin...,Prepare the mushroom broth according to this r...
1,1,https://www.loveandlemons.com/mushroom-broth/,2 medium yellow onions\n2 tablespoons extra-vi...,"Wash and dry the onions, then remove the onion..."
2,2,https://www.loveandlemons.com/broccolini/,2 tablespoons extra-virgin olive oil\n3 garlic...,Heat the olive oil in a large lidded skillet o...
3,3,https://www.loveandlemons.com/pasta-fagioli/,"2 tablespoons extra-virgin olive oil, plus mor...",Heat the olive oil in a large pot or Dutch ove...
4,4,https://www.loveandlemons.com/vegan-meatballs/,16 ounces mixed cremini and shiitake mushrooms...,Preheat the oven to 425°F and line a baking sh...


In [3]:
#checking for missing values
report = df.isna().sum().to_frame()
report = report.rename(columns = {0: 'missing_values'})
report['% of total'] = (report['missing_values'] / df.shape[0]).round(2)
report.sort_values(by = 'missing_values', ascending = False)

Unnamed: 0,missing_values,% of total
ingredients,9,0.01
instructions,9,0.01
Unnamed: 0,0,0.0
url,0,0.0


<div  style="border: solid black 2px; padding: 20px"> <b> Note:</b>
There are 9 missing values to deal with

In [4]:
df[df['ingredients'].isna()]

Unnamed: 0.1,Unnamed: 0,url,ingredients,instructions
745,745,https://www.loveandlemons.com/how-to-make-matcha/,,
948,948,https://www.loveandlemons.com/vegetarian-memor...,,
981,981,https://www.loveandlemons.com/peanut-butter-go...,,
1030,1030,https://www.loveandlemons.com/peach-crumble-su...,,
1172,1172,https://www.loveandlemons.com/mango-fiesta/,,
1244,1244,https://www.loveandlemons.com/ottolenghis-tuna...,,
1271,1271,https://www.loveandlemons.com/grilled-veggie-s...,,
1296,1296,https://www.loveandlemons.com/basil-white-bean...,,
1307,1307,https://www.loveandlemons.com/avocado-chickpea...,,


<div  style="border: solid black 2px; padding: 20px"> <b> Note:</b>
After checking the URLs, it seems the scarper didn't catch the ingredients, and instructions from these pages.

In [5]:
#droping the missing values
df2 = df.copy().dropna()

In [6]:
#checking for missing values
report = df2.isna().sum().to_frame()
report = report.rename(columns = {0: 'missing_values'})
report['% of total'] = (report['missing_values'] / df2.shape[0]).round(2)
report.sort_values(by = 'missing_values', ascending = False)

Unnamed: 0,missing_values,% of total
Unnamed: 0,0,0.0
url,0,0.0
ingredients,0,0.0
instructions,0,0.0


## Prepare data for NLP<a class="anchor" id="Prepare"></a>


In [7]:
#spliting the data to validation and training set 
train_size=round(df2.shape[0]*0.8)
train = df2[:train_size]
val=df2[train_size:]
print('train shape: ', train.shape[0])
print('val shape: ', val.shape[0])

train shape:  1043
val shape:  261


In [8]:

def preprocess_and_labels(data_set):
    """takes a data set and change prepare it for machine learning 
    splits ingridents and instructions and make it one feature and a lanel
     ingredients are labeled 0
     instructions are labeled 1
    """
    ingredients =data_set[['ingredients']]
    ingredients['label'] = np.zeros(ingredients.shape[0])
    ingredients['label']=ingredients['label'].astype('int')
    ingredients.columns=['ingredients_instructions','label']
    
    instructions = data_set[['instructions']]
    instructions['label'] = np.ones(instructions.shape[0])
    instructions['label']=instructions['label'].astype('int')
    instructions.columns=['ingredients_instructions','label']
    
    new_data_set = pd.concat([ingredients,instructions],axis=0)
    new_data_set = new_data_set.reset_index(drop=True)
    return new_data_set

In [9]:
#making new data set using the preprocess_and_labels function
train_pre=preprocess_and_labels(train)
val_pre=preprocess_and_labels(val)
print('train_pre shape: ', train_pre.shape[0])
print('val_pre shape: ', val_pre.shape[0])

train_pre shape:  2086
val_pre shape:  522


In [10]:
train_pre['ingredients_instructions']

0       1 recipe Mushroom Broth\n1 tablespoon rice vin...
1       2 medium yellow onions\n2 tablespoons extra-vi...
2       2 tablespoons extra-virgin olive oil\n3 garlic...
3       2 tablespoons extra-virgin olive oil, plus mor...
4       16 ounces mixed cremini and shiitake mushrooms...
                              ...                        
2081    Preheat oven to 350 degrees.\nSlice pitas into...
2082    Toss tomatoes in a small bowl with olive oil, ...
2083    Preheat your oven to 350 degrees F, with an 8 ...
2084    Drain your soaked cashews and rinse them. In a...
2085    Blend all ingredients until smooth. For a crea...
Name: ingredients_instructions, Length: 2086, dtype: object

In [11]:
#tokniazer for preprocessing
tokenizer = tf.keras.preprocessing.text.Tokenizer()

In [12]:
#fitting on the training set
tokenizer.fit_on_texts(train_pre['ingredients_instructions'])

In [13]:
#saving tokinazer for the testing
tokenizer_json = tokenizer.to_json()
with open("tokenizer_json.json", "w") as outfile:
    json.dump(tokenizer_json, outfile)

In [14]:
#converting the words to sequences using a tokenizer
train_sequences=tokenizer.texts_to_sequences(train_pre ['ingredients_instructions'])
val_sequences=tokenizer.texts_to_sequences(val_pre['ingredients_instructions'])

In [15]:
#show an example
print(train_pre ['ingredients_instructions'][10])
print(train_sequences[10])

Extra-virgin olive oil, for the pan
2 cups diced red bell pepper, about 2 medium
½ cup chopped scallions
9 large eggs
1 garlic clove, grated
Heaping ½ teaspoon sea salt
Freshly ground black pepper
3 tablespoons all-purpose flour
¾ teaspoon baking powder
⅓ cup crumbled feta cheese
[48, 58, 20, 14, 10, 1, 101, 12, 41, 112, 54, 493, 16, 59, 12, 37, 18, 9, 36, 189, 635, 38, 180, 5, 23, 163, 178, 332, 18, 17, 27, 11, 55, 46, 44, 16, 40, 25, 184, 337, 86, 183, 17, 28, 83, 114, 9, 419, 255, 65]


In [16]:
#padding the sequences
def make_pad(sequences,max_length):
    #takes a sequences and padding them for the nlp model
    data_pad = tf.keras.utils.pad_sequences(sequences, maxlen=max_length,padding="post",truncating="post")
    return data_pad

In [17]:
#finding the longest sequence for max_length
print("the longest sequence: ", train_pre[train_pre['ingredients_instructions'] == train_pre['ingredients_instructions'].max()])

the longest sequence:                                ingredients_instructions  label
546  ⅔ cup whole rolled oats\n½ cup pecans, plus mo...      0


In [18]:
print("the max lenght is: ", len(train_sequences[546]))

the max lenght is:  55


In [19]:
#converting the sequences to paddings using a make_pad function
#the max_length is the length of the longest sequence
train_padded = make_pad(train_sequences,55)
val_padded = make_pad(val_sequences,55)
print('train_padded shape: ', train_padded.shape[0])
print('val_padded shape: ', val_padded.shape[0])

train_padded shape:  2086
val_padded shape:  522


## Building the model<a class="anchor" id="models"></a>

In [20]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(len(tokenizer.word_index),32,input_length=55))

model.add(tf.keras.layers.LSTM(64, dropout=0.1))
model.add(tf.keras.layers.Dense(1, activation="sigmoid"))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 55, 32)            108864    
                                                                 
 lstm (LSTM)                 (None, 64)                24832     
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 133,761
Trainable params: 133,761
Non-trainable params: 0
_________________________________________________________________


In [21]:
loss = tf.keras.losses.BinaryCrossentropy(from_logits=False)
optim = tf.keras.optimizers.legacy.Adam(lr=0.01)
metrics = ["accuracy"]

model.compile(loss=loss ,metrics=metrics)

In [22]:
model.fit(train_padded,train_pre['label'], epochs=2, validation_data=(val_padded,val_pre['label']), verbose=2)


Epoch 1/2
66/66 - 2s - loss: 0.1304 - accuracy: 0.9593 - val_loss: 0.1073 - val_accuracy: 0.9789 - 2s/epoch - 36ms/step
Epoch 2/2
66/66 - 1s - loss: 0.0074 - accuracy: 0.9995 - val_loss: 0.0236 - val_accuracy: 0.9962 - 827ms/epoch - 13ms/step


<keras.callbacks.History at 0x2092456d520>

In [23]:
model.save("my_model")



INFO:tensorflow:Assets written to: my_model\assets


INFO:tensorflow:Assets written to: my_model\assets
