# Project Title: Geographic Classification of Cuisine Recipes
### Jinhyeun Kim, Youngjo Kim, Jianyuan Zhai
<br>
<br>

# 1. Introduction
## 1-1. The goal of the project

<font size="4.5"> <b>Classify</b> the geographic origins of recipes based on the ingredients used</font>
<br>
<br>

## 1-2. Why is it important?

<font size="4.5">Cuisines are different across different countries and are primarily affected by geographic conditions, such as local climate, religion, and trade. <b><i>Therefore, classifying the cuisines can be used to improve our understanding of each culture and lifestyle</i></b></font>

In [1]:
from IPython.display import Image, HTML, IFrame
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/Foodworld.png?raw=true')

# 2. Method

## 2-1. How did we approach the problem?

<font size="4.5">1. <b>*Get the data from Kaggle*</b> (Use 'pandas' to read json file into pandas dataframe)<br><br> 2. Data Preprocessing: <b><i>Construct a binary matrix X (39774 x 6714)</i></b> <br><br>$\;\;\;\;\;\;$2-1) 39774 recipes from 20 countries and the total 6714 ingredients for the recipes <br>$\;\;\;\;\;\;$2-2) For recipe $i$, if ingredient $j$ is used, $X_{i,j}$ equals to 1, and if ingredient $j$ is not used, $X_{i,j}$ equals to 0 </font>

<font size="4.5">3. <b><i>Train/Test Split(0.8, 0.2) of binary matrix X and the label y</i></b><br><br> 4. <b><i>Model Training and Selection</i></b>: Use 'Scikit learn' to help train the classfiers<br><br>$\;\;\;\;\;\;$ 1. Perceptron<br>$\;\;\;\;\;\;$ 2. Logistic Regression <br>$\;\;\;\;\;\;$ 3. Linear SVM <br>$\;\;\;\;\;\;$ 4. ANN<br><br> 5. <b><i>Evaluate the model based on ROC Curves</i></b>

In [2]:
IFrame('https://nbviewer.jupyter.org/github/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/GeoChart(Country).html', width=1000, height=650)

In [3]:
img = 'https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/MainWorkflow.png?raw=true'
Image(url=img)

## 2-2. Challenges approached

<font size="4.5">1. The imbalance of the sample matters in statistical estimation technique and can lead to classifier bias<br>2. The matrix is too large and takes a lot of time for running (for ANN, it took over 2 hours for the entire dataset)<br><br>
<b><i>How can we reduce the sample imbalance problem?</i></b></font>

## 2-3. New in our approach

<font size="4.5">Take a Two-step Approach:<br><br><b>What if we classify first by the continents and then classify the country?</b><br>We assume that our two-step approach would not lead to the classification bias since we know that the food varies according to the region (Food and the region is highly correlated)</font>

In [4]:
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/Dendogram(country).PNG?raw=true')

In [5]:
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/Dendogram(continent).PNG?raw=true')

In [6]:
IFrame('https://nbviewer.jupyter.org/github/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/GeoChart(Continent).html', width=1000, height=700)

### Through the two-step approach, we can reduce the imbalance of the sample problem

In [7]:
IFrame('https://nbviewer.jupyter.org/github/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/PieChart(Country).html', width=800, height=400)

In [8]:
IFrame('https://nbviewer.jupyter.org/github/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/PieChart(Continent).html', width=800, height=400)

### There is a high imbalance for Africa since the dataset for Africa only contains 821 data (which is from Morocco)
<font size="4.5">We tried to 'Balance' class_weight (putting less weights on the majority class instances)<br> $\;\;$→ Results in slightly (~0.1%) higher training accuracy but lower testing accuracy than the normal logistic regression</font>

In [9]:
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/Logistic(Weight_balanced_compare).png?raw=true')

## Two-Step Approach Workflow

In [10]:
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/Workflow3.png?raw=true')

## The goal is to correctly classify the recipe to the geographic origins
<font size="4.5">For example in our model, our model should be able to classify the spaghetti to the Italian</font>

In [11]:
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/Workflow4.png?raw=true')

### We can determine one best classifier in the continent level and three best classifiers in the country level for each continent (America, Asia, and Europe)
$\;\;\;\;\;\;$<font size="4.5">1. Use the ROC curves to determine the best model in each level<br>
$\;\;\;\;\;$2. Combine the best model on the continent level and the country level </font>

## 2-4. Evaluation of the Model

## ROC Curves

In [12]:
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/ROCCurvesContinent.png?raw=true')

In [13]:
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/ROCCurvesCountries.png?raw=true')

### We picked the four best models,
$\;\;\;\;\;\;$<font size="4.5">For Continent level, we picked <b>logistic regression</b>
<br>$\;\;\;\;\;$<font size="4.5">For Country level (America), <b>logistic regression</b>
    <br>$\;\;\;\;\;$<font size="4.5">For Country level (Asia), <b>logistic regression</b>
    <br>$\;\;\;\;\;$<font size="4.5">For Country level (EU), <b>SVM</b>
    
### and connect the best models in sequence

In [14]:
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/Accuracycompare1.png?raw=true')

In [15]:
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/Accuracycompare2.png?raw=true')

## Overall Classifier (Best Classifier 1 + Best Classifier 2)

In [16]:
Image(url='https://github.com/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/figures/ComparisonBetweenSVMand2Step_resize.png?raw=true')

## Performance of the two-step approach versus the normal approach

### Overall accuracy for the testing set
<font size="4.5"> Testing Accuracy with SVM            : 0.791
<br>Testing Accuracy with 2-step appraoch: 0.847

<font size="4.5">  (1) 2 step approach is better in test accuracy <b>overall</b>
<br>(2) Based on the confusion matrix plot, however, we <b>cannot conclude</b> that the two-step approach is better in predicting every cuisine

## Interpretation of the Confusion Matrix
<font size="4.5"> Discovering the relationship between the cuisines according to different region and culture </font>


In [18]:
IFrame('https://nbviewer.jupyter.org/github/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/ConfusionMatrix.html', width=900, height=900)

<font size="3.5"><b>1. America section</b>
- 9.3% of Cajun & Creole foods are predicted as Southern US foods 
- More than 8% of each Cajun Creole foods and Southern US foods are predicted as French or Italian foods since many people have come from Europe to America for a long time
- the region of Cajun & Creole foods is Louisiana where was a colony of France, so that the 4.44% of Cajun & Creole foods are predicted as French foods

<font size="3.5"><b>2. Asia section</b>
- Many countries' foods are related to Chinese foods. It is very possible because China has been influential to most countries for several thousand years
- Vietnamese foods and Thai foods may be influential each because they are very close in geographically
- Wrong predicted Filipino foods have a larger ratio of western foods because the Philippines was colonized by Spain and America(??)
- One thing which does not follow our assumption is that we cannot find any relationships between Vietnamese and French foods even though Vietnam was colonized by France for several decades.

<font size="3.5"><b>3. Europe section</b>
- Many foods from European countries are predicted as Italian and French foods because European countries have influenced each other for thousands of years
- Many wrongly predicted results as Italian or French foods may be caused by the number of samples (Italian has the most number of samples and the French category has the second largest number of samples)
- 3.6% of British foods were predicted as Indian foods. Because many Indian people have been moved to England, so this prediction can be explained

<font size="3.5"><b>4. Africa section</b>
- We have only one country, Moroccan, for African data set.
- Some Moroccan foods were predicted as European foods because Morocco is very close to Europe.


# 3. Discussion



### 1. Does the Best Model in Continent Level + Best Model in Country Level lead to Best Model overall?

<font size="4.5"><b>No</b>: Best + Second Best has a slightly higher testing accuracy
<br>Possible reason: Log and SVM has a very small difference of the auc values (less than 0.07%) in the ROC plot, and we used the macro-averaging for multi-class classification
<br>$\;\;\;\;\;\;$ →Different numbers of binarized samples should be penalized by the weights to correctly account for the imbalance of the sample numbers

### 2. How about doing the PCA to the entire large dataset?
<font size="4.5"> Did a Multiple Correspondence Analysis (MCA) for an entire dataset
1. Half of the principal components are needed to explain 99% variance of the binary data (X)
2. Using these reduced features, we did a logistic regression on the entire dataset
<br>$\;\;\;\;\;\;$ → The test accuracy was 0.73, which is lower than the two-step approach
<br>$\;\;\;\;\;\;$ → For the large binary recipe dataset, Feature Extraction does not play a big role in efficiently reducing the dataspace</font>

# 4. Conclusion

<font size="4.5"> 1. We achieved high accuracy in classifying the cuisines with the two-step classifiers 
<br><br>2. We believe this project can benefit people who love food and are willing to learn about the culture behind different recipes 
<br><br>3. We hope our model can be used in the future as a useful source for the people who are interested in discovering the relationship between the cuisines and the lifestyle of the people according to different region and culture</font>

<br><br><br>
# Reference

__Data Source:__ <br>

https://www.kaggle.com/kaggle/recipe-ingredients-dataset#test.json

__Publication:__<br>
1. Kotsiantis SB. Supervised Machine Learning: A Review of Classification Techniques. Proceedings of the 2007 conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies; 2007.
2. Vapnik V, Kotz S. Estimation of Dependences Based on Empirical Data. Springer; 2006.
3. Vapnik V. The Nature of Statistical Learning Theory. Springer New York; 1999.
4. Jonathon Shlens. A Tutorial on Principal Component Analysis. arXiv:1404.1100v1; 2014.
5. Herve Abdi & Dominique Valentin. Multiple Correspondence Analysis; 2007.
6. Howard Bergman et al. Correspondence analysis is a useful tool to uncover the relationships among categorical variables; 2010.

__Others:__
<br>https://www.researchgate.net/post/Machine_learning_if_proportion_of_number_of_cases_in_different_class_in_training_set_matters
https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/chawla2002.html
<br>http://www.bigendiandata.com/2017-06-27-Mapping_in_Jupyter/
<br>https://towardsdatascience.com/a-complete-guide-to-an-interactive-geographical-map-using-python-f4c5197e23e0
<br>https://bokeh.pydata.org/en/latest/docs/gallery/unemployment.html
<br>www.freeworldmaps.net

__Supplement:__
<br>https://nbviewer.jupyter.org/github/Youngjo-Kim/Geographic-Classification-of-Cuisine-Recipes/blob/master/MainCode-RecipeClassification_CX4240_Summer2019.ipynb