In [20]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
# import dataset
data = pd.read_csv('recipe_final (1).csv')
# See Head of dataset
data.head()

Unnamed: 0.1,Unnamed: 0,recipe_id,recipe_name,aver_rate,image_url,review_nums,calories,fat,carbohydrates,protein,cholesterol,sodium,fiber,ingredients_list
0,0,222388,Homemade Bacon,5.0,https://images.media-allrecipes.com/userphotos...,3,15,36,1,42,21,81,2,"['pork belly', 'smoked paprika', 'kosher salt'..."
1,1,240488,"Pork Loin, Apples, and Sauerkraut",4.76,https://images.media-allrecipes.com/userphotos...,29,19,18,10,73,33,104,41,"['sauerkraut drained', 'Granny Smith apples sl..."
2,2,218939,Foolproof Rosemary Chicken Wings,4.57,https://images.media-allrecipes.com/userphotos...,12,17,36,2,48,24,31,4,"['chicken wings', 'sprigs rosemary', 'head gar..."
3,3,87211,Chicken Pesto Paninis,4.62,https://images.media-allrecipes.com/userphotos...,163,32,45,20,65,20,43,18,"['focaccia bread quartered', 'prepared basil p..."
4,4,245714,Potato Bacon Pizza,4.5,https://images.media-allrecipes.com/userphotos...,2,8,12,5,14,7,8,3,"['red potatoes', 'strips bacon', 'Sauce:', 'he..."


In [5]:
# check for missing values
data.isnull().sum()

Unnamed: 0          0
recipe_id           0
recipe_name         0
aver_rate           0
image_url           0
review_nums         0
calories            0
fat                 0
carbohydrates       0
protein             0
cholesterol         0
sodium              0
fiber               0
ingredients_list    0
dtype: int64

In [6]:
data['ingredients_list'][0]

"['pork belly', 'smoked paprika', 'kosher salt', 'ground black pepper']"

# KNN  
**Type:** Supervised Learning Algorithm.  
**Used For:** Classification and regression.  
**Basic Idea:** To predict the class or value of a data point based on the 'K' most similar data point  
in the traning set.
## How KNN Works:
1. choose the number of  neighbors (k):
   1. select the number of neighbors (k) to consider for making the prediction
   2. calculate the distance:
   Compute the distance b/w the new data point and all the traning data points. Common distance  
   metrics include *Euclidean, Manhattan, and Minkowski distance* 
   3. identify the 'k' traning data points that are closest to the new data point based on the calculated distance.
   4. Making predication , classification or recommendation
## In our case
We select 3 neighbors

In [9]:
# preprocess ingredients
# Extracting the important ingredients from the ingredients_list column
vectorizer = TfidfVectorizer()
X_ingredients = vectorizer.fit_transform(data['ingredients_list'])

# Understanding TfidfVectorizer with Example
The TfidfVectorizer is a feature extraction tool in NLP that converts text data into numerical feature vectors. Using the term frequency-inverse document frequency (TF-IDF) method,
Example:  
Consider the following documents:
| Document | Text |  
| --- | --- |  
| Doc 1 | I love dogs |  
| Doc 2 | I hate dogs and knitting |  
| Doc 3 | Knitting is my hobby and passion |
      
Step by step Transformation:    
1. **Tokenization**: Split each document into individual words. 
2. **Term Freq (TF)**: Calculate the term frequwncy value for each word in each document.

Key Concepts:
1. **Term Frequency (TF)**:  
Measure of how frequently a term occurs in a document.
    - TF = (Number of times term t appears in a document) / (Total number of terms in the document)
Example:  
    - As in above example in Doc 1, the word I appears 1 time and the total number of terms in the document is 3. So, TF = 1/3 = 0.33
2. **Inverse Document Frequency (IDF)**:
Measure of how important a term is.
    - IDF = log(N/n), where, N is the total number of documents and n is the number of documents a term t has appeared in.
  -  As in above example we have 3 doc and the word I appears in 2 times so IDF = log(3/2) = 0.176
3. **TF-IDF**:
Product of TF and IDF.
    - TF-IDF = TF * IDF
    - As in above example, TF-IDF for the word I in Doc 1 = 0.33 * 0.176 = 0.058

Same way we can calculate TF-IDF for all the words in all the documents and gives them a numerical values like above i give numeric value for 'I' is 0.058.

In [13]:
# Normalizing the data
scaler = StandardScaler()   
X_numerical = scaler.fit_transform(data[['calories', 'fat', 'carbohydrates', 'protein', 'cholesterol','sodium','fiber']])

# Standard Scaler
StandardScaler is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.
## Key Concepts:
1. **Mean**: The average value of the data.
2. **Standard Deviation**: A measure of the amount of variation or dispersion of feature.
3. **Z-Score Formula**: z = (x - u) / s, where x is the feature value, u is the mean, and s is the standard deviation.
## Example:
Consider the following data:
| Sample | Feature 1 | Feature 2 |
| --- | --- | --- |
| A | 1 | 200 |
| B | 2 | 300 |
| C | 3 | 400 |
| D | 4 | 500 |
| E | 5 | 600 |

Step by step Used:
1. **Calculate Mean and Standard Deviation**:  

| Feature | Mean | Standard Deviation |  
| --- | --- | --- |  
| Feature 1 | 3 | 1.58 |  
| Feature 2 | 400 | 158.11 |  
  
1. **Standardize the Data**:  

| sample | Feature 1(org) | Feature 2(org) | Feature 1(std) | Feature 2(std) |  
| --- | --- | --- | --- | --- |  
| A | 1 | 200 | (1 - 3)/1.58 = -1.27 | (200 - 400)/1.58.11 = -1.27 |  
| B | 2 | 300 | (2 - 3)/1.58 = -0.63 | (300 - 400)/1.58.11 = -0.63 |  
| C | 3 | 400 | (3 - 3)/1.58 = 0 | (400 - 400)/1.58.11 = 0 |  
| D | 4 | 500 | (4 - 3)/1.58 = 0.63 | (500 - 400)/1.58.11 = 0.63|  
| E | 5 | 600 | (5 - 3)/1.58 = 1.27 | (600 - 400)/1.58.11 = 1.27 |  

## Conclusion
Now Model Does not dominate because of variation on data but it gives equal importance to every values

In [14]:
X_numerical

array([[-0.1317045 ,  0.46001924, -1.15482863, ...,  0.04256474,
         1.13990476, -0.76393724],
       [ 0.23857551, -0.33625589, -0.01920347, ...,  0.52871248,
         1.59202345,  2.53220175],
       [ 0.05343551,  0.46001924, -1.02864806, ...,  0.16410167,
         0.15703804, -0.59490447],
       ...,
       [-0.77969453, -0.82286847, -0.27156462, ..., -0.68665688,
        -0.39336732, -0.25683894],
       [ 0.33114552,  0.32730672,  0.73787996, ...,  0.20461398,
        -0.31473798, -0.51038809],
       [-1.33511455, -0.9998185 , -1.02864806, ..., -0.80819381,
        -0.13782197, -0.42587171]])

In [17]:
# Now combine X_ingredients and X_numerical
X_combined = np.hstack([X_numerical,X_ingredients.toarray()])
X_combined

array([[-0.1317045 ,  0.46001924, -1.15482863, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.23857551, -0.33625589, -0.01920347, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.05343551,  0.46001924, -1.02864806, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-0.77969453, -0.82286847, -0.27156462, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.33114552,  0.32730672,  0.73787996, ...,  0.        ,
         0.        ,  0.        ],
       [-1.33511455, -0.9998185 , -1.02864806, ...,  0.        ,
         0.        ,  0.        ]])

In [21]:
# Train KNN Model
knn = NearestNeighbors(n_neighbors=3, metric='euclidean')
knn.fit(X_combined)

We can also used cosin similarties but we mostly deal cosin simiarites when we deal with textual data the   reason why we used KNN is that in our dataset we have mostly numerical featuers so we used KNN  

In [24]:
def recommend_recipe(input_features): # get input features from user
    input_featuers_scaled = scaler.transform([input_features[:7]]) # scale the input features
    input_ingredients_transformed = vectorizer.transform([input_features[7]]) # transform the input ingredients
    input_combined = np.hstack([input_featuers_scaled,input_ingredients_transformed.toarray()]) # combine the input features
    distances, indices = knn.kneighbors(input_combined) # get the indices of the nearest neighbors
    recommendation = data.iloc[indices[0]] # get the recommendations
    return recommendation[['recipe_name','ingredients_list','image_url']] # return the recommendations

In [25]:
input_features = [100, 10, 20, 30, 40, 50, 60, 'chicken, rice, salt, pepper, onion, garlic']
recommendaion = recommend_recipe(input_features)
recommendaion



Unnamed: 0,recipe_name,ingredients_list,image_url
4064,Beefy Broccoli & Cheddar Burritos,"['ground beef', 'Knorr® Rice Sides™', 'water',...",https://images.media-allrecipes.com/userphotos...
12024,"Cumin Lamb Steaks with Smashed Potatoes, Wilte...","['new potatoes', 'butter', 'garlic', 'brown su...",https://images.media-allrecipes.com/userphotos...
12803,Bourbon Street New York Strip Steak,"['boneless New York strip steaks', 'bourbon wh...",https://images.media-allrecipes.com/userphotos...
