# Recommending a Recipe Based on Food Image Predictions

### Overview
The code notebooks contain the following 5 sections:
1. Introduction
2. Data Collection & Cleaning
3. Pre-Processing & Modeling
4. EDA & Data Visualization
5. Conclusion

### Problem Statement
Simply put, I intend on taking an image of food, predicting what it is, and then recommending a recipe that details the ingredients and steps for preparing the dish.

In this project, I hope to accomplish the following objectives:
- Collect image data for five food classes using the pushshift.io Reddit API,
- Build a convolutional neural network to process image data, 
- Make predictions on a multiclass classfication problem with accuracy, recall, and precision scores higher than that of the null model, and
- Recommend a recipe based on the identified food class so that the user can understand the basic ingredients and process of food preparation required to make the dish.

For this multiclass classification problem, I aim to predict the following 5 dishes:
1. Hamburger
2. Hot dog
3. Pizza
4. Taco
5. Sushi

I will be using the following subreddits to collect image data:
- r/burgers (103k members)
- r/hotdogs (19.5k members)
- r/Pizza (309k members)
- r/tacos (47.7k members)
- r/sushi (226k members)

Two major risks that I had assumed would not present any issues actually panned out.

The first issue involves data collection. Originally, I had planned on utilizing a web API to scrape image data from Google Images. Due to the uncertainty of the legality of such an approach, I opted to use a less efficient method for collecting image data (i.e. the pushshift.io Reddit API). Even if the intended approach were legal, I quickly ran into issues with funding and limitations on the number of requested API calls per month since most Google Image search APIs and proxies charge varying monthly fees depending on the plan [(source)](https://www.scraperapi.com/blog/best-google-image-search-apis-and-proxies/).

The second issue involves the sample size. My goal was to build a CNN utilizing a minimum of 1,000 images per class. The large sample size not only took up a lot of storage on my local device in a way that slowed my processing speed but also proved too difficult to validate proper image classification. Since the subreddits sometimes include images that are not actual images of food and require manual verification, I reduced my sample size for each class to 100 images per class, which is less than ideal.

**Note:** At full scale, this project would function like a reverse Google Images. More specifically, the user would provide the model with an image of ANY food in existence, and the model would accurately predict the name of the food and provide a recipe that details the ingredients and steps for preparing the food. This would be extremely helpful for anyone trying a new dish or cuisine or people who have specific dietary restrictions.

### Data Sources
Given more resources, the better way to collect data would have been through a web API for Google Images. Unfortunately, this would have required securing a bigger budget and navigating legal issues (which I am not equipped to do). Thus, for practical reasons, I chose to collect the image data using the pushshift.io Reddit API. The links to the data have been provided below:
- [r/burgers](https://api.pushshift.io/reddit/search/submission?subreddit=burgers): The pushshift.io Reddit API for r/burgers.
- [r/hotdogs](https://api.pushshift.io/reddit/search/submission?subreddit=hotdogs): The pushshift.io Reddit API for r/hotdogs.
- [r/Pizza](https://api.pushshift.io/reddit/search/submission?subreddit=Pizza): The pushshift.io Reddit API for r/Pizza.
- [r/pasta](https://api.pushshift.io/reddit/search/submission?subreddit=pasta): The pushshift.io Reddit API for r/pasta.
- [r/sushi](https://api.pushshift.io/reddit/search/submission?subreddit=sushi): The pushshift.io Reddit API for r/sushi.

### Data Dictionary

Not Available