### Data Preparation Notebook

In [19]:
import datasets
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
import faiss

#### Loading tokenizer & model

In [11]:
tokenizer = DPRContextEncoderTokenizer.from_pretrained('facebook/dpr-ctx_encoder-multiset-base')
model = DPRContextEncoder.from_pretrained('facebook/dpr-ctx_encoder-multiset-base')

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizer'.
Some weights of the model checkpoint at facebook/dpr-ctx_encoder-multiset-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification mode

#### Loading recipes dataset

In [12]:
dataset = datasets.load_dataset("m3hrdadfi/recipe_nlg_lite")

Repo card metadata block was not found. Setting CardData to empty.


#### Extracting the recipe guides column

In [13]:
recipes = [entry['steps'] for entry in dataset['train']]

#### Encoding receipes into vectors

In [14]:
# cut recipes
recipes = recipes[:100]
len(recipes)

100

In [15]:
def encode_recipes(recipes):
    inputs = tokenizer(recipes, return_tensors='pt', padding=True, truncation=True, max_length=512)
    embeddings = model(**inputs).pooler_output
    return embeddings.detach().numpy()

In [16]:
embeddings = encode_recipes(recipes)

#### Creating IndexFlatL2 with the embedeed recipes

In [20]:
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

#### Saving locally index & dataset for Production

In [21]:
faiss.write_index(index, "recipes_index.idx")
dataset.save_to_disk('dataset')

Saving the dataset (1/1 shards): 100%|████████████████████████████████████| 6118/6118 [00:01<00:00, 3513.67 examples/s]
Saving the dataset (1/1 shards): 100%|████████████████████████████████████| 1080/1080 [00:00<00:00, 3632.58 examples/s]
