## Recipe Ingredient Analysis Notebook

### Objectives and Approach

In this notebook, I will work on analyzing ingredients from a Spanish recipe dataset with the following goals:

- **Identify whether a given word or phrase is an ingredient or not**, using a combination of rule-based checks and embedding-based similarity.
- **Detect duplicate ingredients** that appear in different forms (e.g., gender variations like *quemado/quemada*, singular/plural forms) to clean and consolidate the ingredient list.
- **Find substitute or similar ingredients** by leveraging word embeddings, helping with recipe modification or alternative recommendations.
- **Create meaningful categories or clusters of ingredients** to organize them better (e.g., spices, vegetables, dairy).

The code will be designed modularly, allowing us to switch between different embedding models or techniques easily.  
Initially, I will use a **custom-trained Word2Vec model** built on our Spanish recipe corpus to capture the semantic relationships between ingredients.


In [8]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve, roc_auc_score, roc_curve, auc

import gensim

In [None]:
# Load the embedding model (assuming it's a Word2Vec model saved as 'w2v_ngram.model')
embedding_model = gensim.models.Word2Vec.load('models/w2v_ngram.model')

def get_embeddings(words, model=embedding_model):
    """
    Returns a list of embeddings for the given list of words.
    If a word is not in the vocabulary, returns None for that word.
    """
    embeddings = {}
    for word in words:
        if word in model.wv:
            embeddings[word] = (model.wv[word])
    
    return pd.DataFrame.from_dict(embeddings, orient= 'index')


get_embeddings(['harina', 'pimentón_dulce', 'hacer'], embedding_model)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
harina,-1.206671,-1.086648,0.755209,-0.712497,0.547397,-2.350922,0.088066,0.902832,-0.151898,-1.765988,...,1.064667,0.129367,-0.254776,-0.033124,-0.254989,1.365256,-0.434972,-2.14099,-0.111409,0.155156
pimentón_dulce,-1.132072,0.907035,-1.037668,0.622985,0.709571,-0.73563,0.073476,0.711704,-0.511328,-0.733734,...,-0.131478,1.143373,-0.233818,0.052328,0.845094,0.50592,0.249045,-0.754381,-0.148701,0.131099
hacer,1.059608,0.045305,0.101836,-0.86656,0.357926,1.08324,-0.04642,-1.884677,-1.114905,-0.398307,...,0.456596,-1.377939,0.425344,-0.010998,-0.972538,0.401597,0.088806,-0.857304,0.961304,0.278169


### Ingredient Classification: Yes or No?

This section focuses on building a classification model to determine whether a given word or n-gram represents an **ingredient** or **not**.

We will experiment with the following models:

- **Naive Bayes**: Chosen for its simplicity, efficiency, and reasonable performance on text-based data.
- **Logistic Regression**: A strong baseline for binary classification that performs well with high-dimensional data.
- **Random Forest**: Provides good generalization and handles non-linear relationships, helping with noisy or imperfect labels.

### Why we're not using:
- **K-Nearest Neighbors (KNN)**: Too slow for prediction on large, imbalanced datasets.
- **Support Vector Machines (SVM)**: Not ideal for large datasets and may struggle when classes aren't linearly separable.
- **Decision Trees**: Prone to overfitting, especially with noisy labels.

### Considerations:
- Our training dataset was **manually labeled**, so it's crucial the model is **robust to mislabeled examples** or outliers.
- Evaluation will include manual inspection of edge cases and common mistakes.

### Test Cases:
- `"harina de trigo"` → ingredient  
- `"precalentar el horno"` → not an ingredient  
- `"queso rallado"` → ingredient  
- `"batir los huevos"` → not an ingredient  

In [21]:
data =pd.read_csv("data/is_ingrediente.csv").rename(columns = {'Unnamed: 0':'word'})
data.sample(6)

Unnamed: 0,word,is_ingrediente
4461,pimiento_amarillo,True
447,elaboración,False
404,mezclamos,False
2722,aderezar,False
4778,adquieran,False
5864,agrégalo,False
