<a href="https://colab.research.google.com/github/abhishek0981/ML-Projects/blob/main/ECS7020P_miniproject_advanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1) Author

**Student Name**:  Abhishek Anand

**Student ID**: 230773516

# 2) Problem formulation

We need to build a machine learning pipeline that takes as an input a list of ingredients and predicts whether the dish will be healthy or unhealthy.

Now that we have our problem, we need to get the data. We first install the mlend library which has the yummy dataset that contains images of various dishes.

In [None]:
!pip install mlend

Collecting mlend
  Downloading mlend-1.0.0.3-py3-none-any.whl (10 kB)
Collecting spkit>0.0.9.5 (from mlend)
  Downloading spkit-0.0.9.6.7-py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting python-picard (from spkit>0.0.9.5->mlend)
  Downloading python_picard-0.7-py3-none-any.whl (16 kB)
Collecting pylfsr (from spkit>0.0.9.5->mlend)
  Downloading pylfsr-1.0.7-py3-none-any.whl (28 kB)
Collecting phyaat (from spkit>0.0.9.5->mlend)
  Downloading phyaat-0.0.3-py3-none-any.whl (27 kB)
Installing collected packages: python-picard, pylfsr, phyaat, spkit, mlend
Successfully installed mlend-1.0.0.3 phyaat-0.0.3 pylfsr-1.0.7 python-picard-0.7 spkit-0.0.9.6.7


In [None]:
from google.colab import drive

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import spkit as sp

from skimage import exposure
from skimage.color import rgb2hsv, rgb2gray
import skimage as ski

import mlend
from mlend import download_yummy_small, yummy_small_load, download_yummy, yummy_load

import os, sys, re, pickle, glob
import urllib.request
import zipfile

import IPython.display as ipd
from tqdm import tqdm
import librosa

drive.mount('/content/drive')

Mounted at /content/drive


3) Machine Learning pipeline
The pipeline contains the following stages:

Input data : Images downloaded from the mounted drive

Data Preprocessing: Cleaning and Labelling the data

Feature extraction : Extracting significant features from the data.

Modelling : Applying the machine learning model

Evaluation : Testing the model performance

# 4) Dataset


We will now download the entire dataset using the download_yummy method defined in the MLEnd library. The dataset contains a csv file describing the images.

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Data/MLEnd/yummy/MLEndYD_image_attributes_benchmark.csv').set_index('filename')
df.head()

Unnamed: 0_level_0,Diet,Cuisine_org,Cuisine,Dish_name,Home_or_restaurant,Ingredients,Healthiness_rating,Healthiness_rating_int,Likeness,Likeness_int,Benchmark_A
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
000001.jpg,non_vegetarian,japanese,japanese,chicken_katsu_rice,marugame_udon,"rice,chicken_breast,spicy_curry_sauce",neutral,3.0,like,4.0,Train
000002.jpg,non_vegetarian,english,english,english_breakfast,home,"eggs,bacon,hash_brown,tomato,bread,tomato,bake...",unhealthy,2.0,like,4.0,Train
000003.jpg,non_vegetarian,chinese,chinese,spicy_chicken,jinli_flagship_branch,"chili,chicken,peanuts,sihuan_peppercorns,green...",neutral,3.0,strongly_like,5.0,Train
000004.jpg,vegetarian,indian,indian,gulab_jamun,home,"sugar,water,khoya,milk,salt,oil,cardamon,ghee",unhealthy,2.0,strongly_like,5.0,Train
000005.jpg,non_vegetarian,indian,indian,chicken_masala,home,"chicken,lemon,turmeric,garam_masala,coriander_...",healthy,4.0,strongly_like,5.0,Train


In [None]:
df.isna().sum()

Diet                      0
Cuisine_org               5
Cuisine                   5
Dish_name                 0
Home_or_restaurant        0
Ingredients               0
Healthiness_rating        1
Healthiness_rating_int    1
Likeness                  4
Likeness_int              4
Benchmark_A               0
dtype: int64

We see that that some columns have some missing values, but we are only concerned with the columns 'Ingredients' and 'Healthiness_rating_int'.

'Healthiness_rating_int' has only 1 missing value, so we can drop that row:

In [None]:
new_df = df[['Ingredients','Healthiness_rating_int']]
print(new_df.shape)

fil_df = new_df.dropna(axis=0)

print(fil_df.shape)


(3250, 2)
(3249, 2)


In [None]:
fil_df.isna().sum()

Ingredients               0
Healthiness_rating_int    0
dtype: int64

Now, we label that data. We will use the integer 1 to represent healthy, and 0 to represent unhealthy. We label the ingredients using the column 'Healthiness_rating_int' which has a value from 1 to 5. If the healthiness rating value is more than or equal to 3, then the label is '1' or 'healthy', otherwise the label is '0' or unhealthy.

In [None]:
fil_df['label'] = np.where(fil_df['Healthiness_rating_int']>=3 , '1','0')
fil_df.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


(3249, 3)

In [None]:
final_df = fil_df[['Ingredients','label']]
final_df

Unnamed: 0_level_0,Ingredients,label
filename,Unnamed: 1_level_1,Unnamed: 2_level_1
000001.jpg,"rice,chicken_breast,spicy_curry_sauce",1
000002.jpg,"eggs,bacon,hash_brown,tomato,bread,tomato,bake...",0
000003.jpg,"chili,chicken,peanuts,sihuan_peppercorns,green...",1
000004.jpg,"sugar,water,khoya,milk,salt,oil,cardamon,ghee",0
000005.jpg,"chicken,lemon,turmeric,garam_masala,coriander_...",1
...,...,...
003246.jpg,"1_cup_basmati_rice,2_cups_water,2_tablespoons_...",1
003247.jpg,"fried_cottage_cheese,ghee,lentils,milk,wheat_f...",1
003248.jpg,"potato,onion,peanut,salt,turmeric_powder,red_c...",0
003249.jpg,"kiwi,banana,apple,milk",1


In [None]:
X = final_df['Ingredients']
y = final_df['label']


We seperated the Ingredients and labels, now let us split the dataset into test and training set. The using the train_test_split function from sklearn. The train set will contain 70% of the dataset.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 5) Transformation stage

We will use the TF-IDF for feature extraction. Term Frequency-Inverse Document Frequency is a technique used for converting textual data into numerical vectors, making it suitable for machine learning algorithms.

Term Frequency (TF) measures how often a term (word) appears in a document relative to the total number of terms in that document. Inverse Document Frequency (IDF) measures the importance of a term across a collection of documents. It helps to identify terms that are rare or unique and can distinguish documents from each other.

TF-IDF Vectorization combines the TF and IDF values to create a numerical representation of each document in a corpus. The vector for a document is a set of TF-IDF values for each term in the vocabulary.

In [None]:
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)


In the code, fit_transform is used on the training set to learn the vocabulary and compute the TF-IDF values for each term in each document. The same vectorizer is then used to transform the test set, ensuring consistent representation based on the learned vocabulary.

The resulting X_train_tfidf and X_test_tfidf matrices can be used as input features for machine learning models. Each row represents a document, and each column represents a term with its associated TF-IDF value. This approach allows the model to understand the importance of different terms in each document.

# 6) Modelling

We will use the logistic regression algroithm .It is commonly used for binary classification problems (two classes)

In [None]:
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)



In [None]:
predictions = model.predict(X_test_tfidf)

#7 Results

In [None]:
ytp = model.predict(X_train_tfidf)
ysp = model.predict(X_test_tfidf)

train_accuracy = np.mean(ytp==y_train)
test_accuracy  = np.mean(ysp==y_test)

print('Training Accuracy:\t',train_accuracy)
print('Test  Accuracy:\t',test_accuracy)

Training Accuracy:	 0.8540017590149517
Test  Accuracy:	 0.8246153846153846


## 8) Conclusions
    
Our model seems robust as the training and test accuracies are both high and
close to each other.