# CUNY SPS MSDS – DATA 620 Final Project – Recipe Analysis

### Betsy Rosalen and Mikhail Groysman

## Project Overview


### Final Project 

Your project should incorporate one or both of the two main themes of this course: network analysis and text processing. You need to show all of your work in a coherent workflow, and in a reproducible format, such as an IPython Notebook or an R Markdown document. If you are building a model or models, explain how you evaluate the “goodness” of the chosen model and parameters. 

### Final Project Presentation 

We’ll schedule a short presentation for each team, either in our last scheduled meet-up or in additional office hours to be scheduled during the last week of classes.

### Policy on Collaboration 

You may work in a team of up to three people. Each project team member is responsible for understanding and being able to explain all of the submitted project code. Remember that you can take work that you find elsewhere as a base to build on, but you need to acknowledge the source, so that I base your grade on what you contributed, not on what you started with!

## Recipe ingredients data

We chose a dataset that we found on the [Data Is Plural — Structured Archive](https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0) that conists of 56,498 recipes from various cuisines that were scraped from 3 popular recipe websites.

__Description from Data Is Plural__

> For their 2011 paper, “Flavor network and the principles of food pairing,” four scientists analyzed 56,498 recipes downloaded from three websites — allrecipes.com, epicurious.com, and menupan.com. To support their findings, the authors published two datasets. One names the cuisine and ingredients for each recipe. The other dataset counts how often any two ingredients appeared in the same recipe. (Parmesan cheese and beef appeared together 93 times; starfruit and Algerian geranium oil just once.) Related: “food2vec – Augmented cooking with machine intelligence,” published last month. [h/t Rob Barry](http://rob-barry.com/).

The __original research article__, Flavor network and the principles of food pairing, can be found here: __[Flavor network and the principles of food pairing](http://www.nature.com/articles/srep00196)__  
The __additional related article__ cited above can be found here: __[food2vec – Augmented cooking with machine intelligence](https://jaan.io/food2vec-augmented-cooking-machine-intelligence)__  

__The data__ is easily downloaded in CSV format from the __[Electronic supplementary material](https://www.nature.com/articles/srep00196#Sec8)__ section of the Flavor network and the principles of food pairing research paper webpage.

__The data downloads consist of the following two files:__

- srep00196-s2.csv - counts of how many flavour compounds any two ingredients share
- srep00196-s3.csv - one record per recipe with the ingredients listed in columns

__Structure of the srep00196-s2 dataset:__

- The paired ingredients are listed one each in the first two columns and the count of the number of times that pair of ingredients are found in the same recipe in all recipes across all cuisines in the dataset is in the third column. We decided not to use this dataset, since we opted instead to create our own counts grouped by cuisine from the other file. Information about the cuisines for each pairing are not available in this file.  

- Additionally, there is some confusion about what this data actually represents since a different source, [Recipes for learning](https://www.visibledata.co.uk/blog/2018/02/18/2018-02-18-recipes-for-learning/), suggested that the third column in fact represents the number of flavor compaunds that the two ingredients share.  As a result we decided not to use this data and to create our own list of common pairs of ingredients from the other file.

__Structure of the srep00196-s3 dataset:__

- The type of cuisine is listed in the first column with the remaining columns containing one ingredient per column. There are 32 additional columns in the file, so the maximum number of ingredients for any one recipe is 32.
    - The cuisine categories include:
        - African
        - EastAsian
        - EasternEuropean
        - LatinAmerican
        - MiddleEastern
        - NorthAmerican
        - NorthernEuropean
        - SouthAsian
        - SoutheastAsian
        - SouthernEuropean
        - WesternEuropean

Significant data manipulation was necessary to reshape and analyze this dataset both as a text and as a network.  

## Loading Libraries

In [1]:
import pandas as pd
import numpy as np

import nltk
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
from nltk.tokenize import word_tokenize

import networkx as nx
from networkx.algorithms import bipartite as bi

from scipy import stats
import math
import random
random.seed(250)

import matplotlib.pyplot as plt
%matplotlib inline

# jupyter setup
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_rows', 500)
plt.rcParams["figure.figsize"] = (15,12)

## Loading Data

In [40]:
file_name ='https://raw.githubusercontent.com/betsyrosalen/DATA_620_Web_Analytics/master/Final_Project_Data/srep00196-s3.csv'

columns = ['Cuisine', 'ingred1','ingred2','ingred3','ingred4','ingred5','ingred6','ingred7','ingred8','ingred9',
           'ingred10','ingred11','ingred12','ingred13','ingred14','ingred15','ingred16','ingred17','ingred18',
           'ingred19','ingred20','ingred21','ingred22','ingred23','ingred24','ingred25','ingred26','ingred27',
           'ingred28','ingred29','ingred30','ingred31','ingred32']

recipes = pd.read_csv(file_name, header=None, skiprows=4, names=columns, encoding = 'utf-8',)

recipes.head()
print("There are "+str(recipes.shape[0])+ " recipes with a maximum of "+str(recipes.shape[1]-1)+" ingredients each.")

Unnamed: 0,Cuisine,ingred1,ingred2,ingred3,ingred4,ingred5,ingred6,ingred7,ingred8,ingred9,...,ingred23,ingred24,ingred25,ingred26,ingred27,ingred28,ingred29,ingred30,ingred31,ingred32
0,African,chicken,cinnamon,soy_sauce,onion,ginger,,,,,...,,,,,,,,,,
1,African,cane_molasses,ginger,cumin,garlic,tamarind,bread,coriander,vinegar,onion,...,,,,,,,,,,
2,African,butter,pepper,onion,cardamom,cayenne,ginger,cottage_cheese,garlic,brassica,...,,,,,,,,,,
3,African,olive_oil,pepper,wheat,beef,onion,cardamom,cumin,garlic,rice,...,,,,,,,,,,
4,African,honey,wheat,yeast,,,,,,,...,,,,,,,,,,


There are 56498 recipes with a maximum of 32 ingredients each.


## Create a Corpus from the Data

First we have to figure out how to get all the ingredients into one string that can be wrtten to a file to create each text in our corpus.  After some trial and error we got the following code to do what we needed.

In [21]:
for index, r in recipes.head().iterrows():
    string = ""
    for col in r[1:32]:
        if type(col) == str:
            string = string+str(col)+" "
    print(string)

chicken cinnamon soy_sauce onion ginger 
cane_molasses ginger cumin garlic tamarind bread coriander vinegar onion beef cayenne parsley wheat_bread yogurt vegetable_oil egg 
butter pepper onion cardamom cayenne ginger cottage_cheese garlic brassica 
olive_oil pepper wheat beef onion cardamom cumin garlic rice leek 
honey wheat yeast 


Now we can use a [function we found on StackOverflow](https://stackoverflow.com/questions/49088978/how-to-create-corpus-from-pandas-data-frame-to-operate-with-nltk/49104725) and modify it to create our corpus.  I am putting this code in a markdown cell so that it doesn't run again each time we run the notebook. Copy the code into a code cell or change the cell to a code cell to run it to create your text files if you want to reproduce this analysis.

#https://stackoverflow.com/questions/49088978/how-to-create-corpus-from-pandas-data-frame-to-operate-with-nltk/49104725

```
def CreateCorpusFromDataFrame(corpusfolder,df):
    for index, r in df.iterrows():
        id = 'recipe'+str(index)
        title = 'recipe'+str(index)
        body = ""
        for col in r[1:32]:
            if type(col) == str:
                body = body+str(col)+" "
        cuisine = r['Cuisine']
        fname = str(cuisine)+'_'+str(id)+'.txt'
        corpusfile = open(corpusfolder+'/'+fname,'a')
        corpusfile.write(str(body)+" "+str(title))
        corpusfile.close()

CreateCorpusFromDataFrame('Final_Project_corpusfolder',recipes)
```

Finally we can run the following code to create our corpus in NLTK.

In [29]:
my_corpus=CategorizedPlaintextCorpusReader('Final_Project_corpusfolder/',r'.*', cat_pattern=r'(.*)_.*')

In [30]:
my_corpus.fileids()

['.DS_Store',
 'African_recipe0.txt',
 'African_recipe1.txt',
 'African_recipe10.txt',
 'African_recipe100.txt',
 'African_recipe101.txt',
 'African_recipe102.txt',
 'African_recipe103.txt',
 'African_recipe104.txt',
 'African_recipe105.txt',
 'African_recipe106.txt',
 'African_recipe107.txt',
 'African_recipe108.txt',
 'African_recipe109.txt',
 'African_recipe11.txt',
 'African_recipe110.txt',
 'African_recipe111.txt',
 'African_recipe112.txt',
 'African_recipe113.txt',
 'African_recipe114.txt',
 'African_recipe115.txt',
 'African_recipe116.txt',
 'African_recipe117.txt',
 'African_recipe118.txt',
 'African_recipe119.txt',
 'African_recipe12.txt',
 'African_recipe120.txt',
 'African_recipe121.txt',
 'African_recipe122.txt',
 'African_recipe123.txt',
 'African_recipe124.txt',
 'African_recipe125.txt',
 'African_recipe126.txt',
 'African_recipe127.txt',
 'African_recipe128.txt',
 'African_recipe129.txt',
 'African_recipe13.txt',
 'African_recipe130.txt',
 'African_recipe131.txt',
 'Afri

In [32]:
my_corpus.categories()

['.DS',
 'African',
 'EastAsian',
 'EasternEuropean',
 'LatinAmerican',
 'MiddleEastern',
 'NorthAmerican',
 'NorthernEuropean',
 'SouthAsian',
 'SoutheastAsian',
 'SouthernEuropean',
 'WesternEuropean']

In [33]:
my_corpus.words(categories='African')

['chicken', 'cinnamon', 'soy_sauce', 'onion', 'ginger', ...]

In [38]:
my_corpus.sents(categories=['EasternEuropean', 'NorthernEuropean', 'SouthernEuropean', 'WesternEuropean'])

[['butter', 'onion', 'potato', 'haddock', 'black_pepper', 'parsley', 'celery', 'milk_fat', 'smoke', 'milk', 'cream', 'recipe2864'], ['butter', 'lemon_juice', 'wheat', 'yeast', 'apricot', 'milk_fat', 'egg', 'milk', 'recipe2865'], ...]

### Step 3. Exploratory Data Analysis

In [None]:
print(ING.shape)

In [None]:
print(COU.shape)

In [None]:
def EDA_DF(df):
    
    pd.options.display.float_format = '{:,.2f}'.format
    #EDA_DF 1. Getting file domensions.
    print(df.shape)

    
    #EDA_DF 2. Looking at columns
    print(df.columns.values)

    
    #EDA_DF 3. We get column description.
    print(df.describe())

    #EDA_DF 4. Let's check variables types.
    print(df.info())

    #EDA_DF 5. Let's see how many vaues are missing
    print(df.isnull().sum())

    #EDA_DF 6. Let's see first raws of the dataset.
    print(df.head())
 

EDA_DF(ING)

Number of shared compounds varies from 1 to 227, with average of 9. Most compounds share just one ingredient.

In [None]:
EDA_DF(COU)

Number of ingredients in recipies varies from 1 to 32.

#### 3.1 Flavour Compound Data

In [None]:
ING.sort_values(by='NumRecipes', ascending=False).head()

2 types of beer share the most flavour compounds.

In [None]:
ING['test'] = ING.apply(lambda x: x['Ing1'] in x['Ing2'], axis=1)

ING[ING['test']==True].head()

Many ingredients have simular names and generally are simular ingredients, such as pork and pork liver and so on.

In [None]:
ING1=pd.melt(ING, id_vars=['NumRecipes'], value_vars=['Ing1', 'Ing2'])

ING1.head()

In [None]:
temp = ING1.value.value_counts()
temp.head()

Black tea has the most occurances in the dataset - 989. Alltogether we have 1,507 unique ingredients.

In [None]:
ING[(ING['Ing1']=='black_tea') | (ING['Ing2']=='black_tea')].head()

In [None]:
INGsub=ING[ING['NumRecipes']>134]

import networkx as nx
G=nx.from_pandas_edgelist(INGsub, 'Ing1', 'Ing2', 'NumRecipes')
pos = nx.circular_layout(G)

import matplotlib.pyplot as plt 

plt.rcParams["figure.figsize"] = (15,15) # set plot size

#weights = [math.log(edata['attr_dict'][200]) for f, t, edata in G0.edges(data=True)] # set weights

nx.draw(G, pos, with_labels=True,  node_size=50, 
        font_size=8,  edge_color="skyblue")

In [None]:
tea=ING[((ING['Ing1']=='black_tea') | (ING['Ing2']=='black_tea')) & (ING['NumRecipes']>60)]
tea.head()

In [None]:
G=nx.from_pandas_edgelist(tea, 'Ing1', 'Ing2', 'NumRecipes')
pos = nx.circular_layout(G)

plt.rcParams["figure.figsize"] = (15,15) # set plot size

#weights = [math.log(edata['attr_dict'][200]) for f, t, edata in G0.edges(data=True)] # set weights

nx.draw(G, pos, with_labels=True,  node_size=50, 
        font_size=9,  edge_color="skyblue")

To my surprise, black tea shares so many different compounds with so many diverse types of food, such as whiskey or mashed potato. None of them taste as tea to me:)

In [None]:
ING['sclog']=np.log(ING['NumRecipes'])
ING.head()

In [None]:
plt.rcParams["figure.figsize"] = (14,6) # set plot size
def myhist(df,mycolumns):
    
    from matplotlib.pyplot import figure
    figure(num=None, figsize=(20, 6), dpi=80, facecolor='w', edgecolor='k')

    for i in mycolumns: 
        df.hist(column=i, bins=50) 
        
myhist(ING,['NumRecipes'])

As we already have seen most ingredients do not share flavour compounds.

In [None]:
plt.rcParams["figure.figsize"] = (12,5) # set plot size
ING.boxplot(column='sclog')

In [None]:
plt.rcParams["figure.figsize"] = (20,7) 
temp.hist(bins=100)

In [None]:
temp1=np.log(temp)

temp1.hist(bins=100)

We have some ingredients to appear only once or twice in the dataset, but most appear much more often.

#### 3.2 Recipes Data

In [None]:
temp=32-COU.isnull().sum(axis=1)

temp.hist(bins=33)

Number of ingredients in recipies is right skewed. But generally we have very few recipies with more than 15 ingredients.

In [None]:
tempdf=temp.to_frame()

tempdf.head()

In [None]:
COU1=COU['Cuisine']

temp1=tempdf.join(COU, how='outer')

temp1.groupby('Cuisine').count()

NorthAmerican recipies represent disproportional number of recipies 41,524 out of 56,498; NorthernEuropean are the most underrepresented - only 250.

In [None]:
temp1['Cuisine'].value_counts().plot(kind='bar')

In [None]:
COU2=pd.melt(COU, id_vars=['Cuisine'], value_vars=['I1', 'I2','I3','I4','I5','I6','I7','I8','I9','I10','I11','I12','I13','I14',
                                                  'I15','I16','I17','I18','I19','I20','I21','I22','I23','I24','I25','I26',
                                                  'I27','I28','I29','I30','I31','I32'])

COU2.head()

In [None]:
COU2=COU2.rename(columns={"value": "Ing"})

COU2.head()

In [None]:
temp = COU2.Ing.value_counts()
temp.head()

In [None]:
temp.tail()

We have 381 unique ingredients. Accross all recipies, egg is the most popular ingredient, wheat is second, and butter is third. Durain, beech, strawberry jam are all precent in just one recipe. 

In [None]:
np.log(temp).hist(bins=50)

If we take log of ingredient occurences we get some combination of normal and uniform.

In [None]:
temp1=COU2.groupby(['Cuisine','Ing']).count()

temp1.reset_index().head()

In [None]:
temp1.groupby('Cuisine').count()

Our recipies, for Northern America, have 354 different ingredients, while for Northern European, they have only 175 unique ingredients.

In [None]:
pairs = pd.DataFrame({'Cuisine' : [], 'Ing1': [], 'Ing2' : []})
for i in range(1,32):
    for j in range((i+1),33):
        temp=COU.iloc[:,[0,i,j]]
        temp.columns=['Cuisine','Ing1','Ing2']
        temp=temp.dropna()
        pairs=pairs.append(temp,ignore_index=True)
        
   

In [None]:
pairs1=pairs.dropna()

pairs1.head()

In [None]:
pairs2=pairs1.reset_index()

pairs3=pairs2.groupby(['Cuisine','Ing1','Ing2'], as_index=False).count()

pairs3.head()

In [None]:
pairs3.shape

In [None]:
pairs3.sort_values(by='index', ascending=False).head(10)

Out of all pairs "wheat"-"egg" came the first.

In [None]:
myhist(pairs3,['index'])

Interestingly that most pairs do not have such a high frequency.

In [None]:
pairs3['LogInd']=np.log(pairs3['index'])

myhist(pairs3,['LogInd'])

In [None]:
pairs4=pairs3[(pairs3['index']>600) & (pairs3['Cuisine']=='NorthAmerican')]
pairs4.shape

In [None]:
G=nx.from_pandas_edgelist(pairs4, 'Ing1', 'Ing2', 'index')
pos = nx.circular_layout(G)

In [None]:
plt.rcParams["figure.figsize"] = (15,15) # set plot size

#weights = [math.log(edata['attr_dict'][200]) for f, t, edata in G0.edges(data=True)] # set weights

nx.draw(G, pos, with_labels=True,  node_size=50, 
        font_size=11,  edge_color="skyblue")

Above is a graph of the most common ingredients of North American cuisine and their intercations.

### 4. Analysis


#### 4.1 Number of Ingredients by Cuisine.

In [None]:
temp=32-COU.isnull().sum(axis=1)
tempdf=temp.to_frame()
temp1=tempdf.join(COU, how='outer')

In [None]:
temp1.groupby('Cuisine').mean()

SoutheastAsian recipies have the most ingredients, while NorthernEuropean ones have the fewest.

In [None]:
plt.rcParams["figure.figsize"] = (14,6) # set plot size
temp1.groupby('Cuisine').mean().reset_index().plot.bar(x='Cuisine')

In [None]:
temp2=temp1.reset_index()


plt.rcParams["figure.figsize"] = (15,15) # set plot size
temp1[0].hist(by=temp1['Cuisine'],bins=25)

Histograms of number of ingredients by cuisine

#### 4.2 Venn Diagram


In [None]:
import matplotlib_venn as venn

from matplotlib_venn import venn3

In [None]:
VD=COU2[['Cuisine','Ing']]
VD1=VD.dropna()
afr=VD1[VD1['Cuisine']=='African']
ea=VD1[VD1['Cuisine']=='EastAsian']
ee=VD1[VD1['Cuisine']=='EasternEuropean']
la=VD1[VD1['Cuisine']=='LatinAmerican']
me=VD1[VD1['Cuisine']=='MiddleEastern']
na=VD1[VD1['Cuisine']=='NorthAmerican']
ne=VD1[VD1['Cuisine']=='NorthernEuropean']
sa=VD1[VD1['Cuisine']=='SouthAsian']
sea=VD1[VD1['Cuisine']=='SoutheastAsian']
se=VD1[VD1['Cuisine']=='SouthernEuropean']
we=VD1[VD1['Cuisine']=='WesternEuropean']

In [None]:
set1 = set(afr['Ing'])
set2 = set(ea['Ing'])
set3 = set(ee['Ing'])
set4 = set(la['Ing'])
set5 = set(me['Ing'])
set6 = set(na['Ing'])
set7 = set(ne['Ing'])
set8 = set(sa['Ing'])
set9 = set(sea['Ing'])
set10 = set(se['Ing'])
set11 = set(we['Ing'])

In [None]:
plt.rcParams["figure.figsize"] = (11,6) # set plot size
venn3([set1, set2, set3], ('African', 'EastAsian', 'EasternEuropean'))

Interesting, African cuisine has 18 unique ingredients, while East Asian has 50!, and Eastern European has only 15.

18 unique African ingredients:

In [None]:
set1.difference(set2).difference(set3)

Analysis of 18 ingredients leads us to conclude that data is very incomplete. Peach is used in East Asian cuisine for instance and sunflower oil is extremly popular in Eastern Europe. So it is not clear why these ingredients were not include in the dataset.

50 unique East Asian ingredients:

In [None]:
set2.difference(set1).difference(set3)

As coming from Eastern Europe, I can pinpoint that beef liver, grape, melon, oatmeal, and watermelon are all common ingredients in Eastern Europe, so the reasons for exclusion is not clear.

15 unique Eastern Europen ingredients:

In [None]:
set3.difference(set1).difference(set2)

In [None]:
venn3([set4, set5, set6], ('LatinAmerican', 'MiddleEastern', 'NorthAmerican'))

North American cuisine has 69 unique ingredients, that are not part of either Latin American or Middle Eastern cuisine. All 3 cuisines share 200 common ingredients.

In [None]:
venn3([set7, set8, set9], ('NorthernEuropean', 'SouthAsian', 'SoutheastAsian'))

These 3 cuisines have a lot unique ingredients. Northern European has 28 ingredients it does not share with other 2. While Southeast Asian has 27 unique ingredients and South Asian 31.

In [None]:
venn3([set10, set11, set2], ('SouthernEuropean', 'WesternEuropean', 'EastAsian'))

Surprisingly, these 3 cuisines have huge overlap with each other, even though East Asian does not use ~ 100 ingredients that European cousins do. 

#### 4.3 Unique Ingredients by Cuisine.

African cuisne does not have any ingredients that are not part of 10 other cuisines. 

East Asian has 6 unique ingredients. I am not clear what raw beef means. If it is not cooked then French eat raw beef as well. 

In [None]:
set2.difference(set1).difference(set3).difference(set4).difference(set5).difference(set6).difference(set7).difference(set8).difference(set9).difference(set10).difference(set11)

Eastern European, Latin American, Middle Eastern cuisines do not have unique ingredients.

North America has 20 unique ingredients. Again, I do not agree with data. Jasmine tea is much more popular in East Asia, than in North America. The same applies to mate, more popular in South America. Again, roasted hazelnut consumed in Europe and Middle East. And carob is consumed widely in Middle East. And as a last straw, sturgeon caviar made to North American list, but not to Eastern European! So on, and so on.

In [None]:
set6.difference(set1).difference(set2).difference(set3).difference(set4).difference(set5).difference(set7).difference(set8).difference(set9).difference(set10).difference(set11)

Southeast Asian unique ingredients are:

In [None]:
set9.difference(set1).difference(set2).difference(set3).difference(set4).difference(set5).difference(set6).difference(set7).difference(set8).difference(set10).difference(set11)

Southern European unique ingredients are:

In [None]:
set10.difference(set1).difference(set2).difference(set3).difference(set4).difference(set5).difference(set6).difference(set7).difference(set8).difference(set9).difference(set11)

Western European unique ingredients are (I am not sure that Jamacain people will agree that Jamaican Rum is uniqely Western European):

In [None]:
set11.difference(set1).difference(set2).difference(set3).difference(set4).difference(set5).difference(set6).difference(set7).difference(set8).difference(set9).difference(set10)

#### 4.4 Most Common Ingredients by Cuisine.

In [None]:
pd.set_option('display.max_rows', 400)
temp1=COU2.groupby(['Cuisine','Ing']).count()
temp1.sort_values(['Cuisine','variable'],ascending = False).groupby('Cuisine').head(10)

Depending on cuisine, the most common ingredients are different. For instance, for Western European, the most common ingredients are butter, egg, and wheat, while for African are olive oil, onion, and cumin.

Some observations:

- Butter is the most common ingredient in Western Europe, Northern Europe, North America, and Eastern Europe. However, it does not make in top 3 in any other cuisines. In Southern Europe, it is # 10, in Middle East # 6.
- Olive oil is # 1 ingredient in Southern Europe and Africa, in Middle East it is the 2nd most common ingredient. In Western Europe it is # 9.
- Garlic - number 1 in Southeast Asian, number 2 in Southern Europe and East Asia, number 3 in Latin America. # 8 in Western Europe, # 5 in South Asia, # 6 in North America, # 5 in Middle East, # 8 in Eastern Europe, # 4 in Africa. I think we can say that garlic is loved all around the world, except for Northern Europe.
- Cumin - number 1 in South Asia, number 3 in Africa
- Wheat - number 1 in Middle East, number 2 in Northern Europe, number 3 in Western Europe, Eastern Europe, and North America
- Cayenne - number 1 in Latin America, number 3 in Southeast Asia
- Soy Sauce - number 1 in East Asia
- Fats - as we already mentioned, the most common fat in Western Europe, Northern Europe, North America, and Eastern Europe is butter. In Southern Europe, Africa, and Middle East, it is olive oil. In Southeast Asia, South Asia, and Latin America, it is generic vegetable oil. In East Asia, it is sesame oil.
- Most common protein - in Western Europe, Southern Europe, Northern Europe, North America, Middle East, Eastern Europe it is egg. In Southeast Asia, it is fish. In South Asia, it is yogurt. In Latin America, it is cheese. In East Asia, soybean. In Africa, it is chicken.
- Most common carbs - in Western Europe, Southern Europe, Northern Europe, North America, Middle East, Eastern Europe, and Africa it is wheat. In Southeast Asia, South Asia, and East Asia it is rice. In Latin America, it is corn.  
- Most common spice - in Western Europe, Northern Europe, North America, Middle East, Eastern Europe, and Africa - onion. In Southern Europe, Southeast Asia, and East Asia, garlic. In South Asia, cumin. In Latin America, cayenne.
- Most common vegetable - in Western Europe, Northern Europe, and Eastern Europe, potato. In Southern Europe, South Asia, North America, Middle East, and Latin America, tomato. In Southeast Asia and Africa, bell pepper. In East Asia, carrot.
- The most common meat - 


#### 4.5 Most Common Ingredient Pairs by Cuisine

In [None]:
pairs3.sort_values(['Cuisine','index'],ascending = False).groupby('Cuisine').head(6)

Observations:

- Butter and Wheat - in Western European and Northern European cuisines, mit is the most common combination. In North America and Eastern Europe, it is the second most common. Butter is also commonly cooked with egg. Wheat is also commonly cooked with egg as well.
- Olive Oil and Garlic - in Southern Europe, it is the most common combination. In Middle East and Africa, it is the second most common combination. Both olive oil and garlic often cooked with tomatoes. Olive oil is also common,y cooked with cummin and onions.
- Garlic and Vegetable Oil - similar to Olive Oil and Garlic above, in Southeast Asia, it is the most common combination.
- Cumin and Tumeric - these 2 spices is the most common combination in South Asia. Both of these spices are often cooked with coriander.
- Wheat and Egg - as already mentioned, this is a popular combination. Actually, it is # 1 combination in Northern America, Middle East, and Eastern Europe.
- Onion and Cayenne - spicy! The most common combination in Latin America. Onion is also commonly cooked with olive oil and tomato.
- Cayenne and Scalion - spicy again. The most common combination is East Asia.
- Olive Oil and Cumin - the most common combination in Africa.
- Fat + Carb Combo - we have seen very common combo, such as Butter and Wheat, to some extend Egg and Wheat, Milk and Wheat, Wheat and Cream can be included
- Fat + Spice - a lot of combinations, such as Olive Oil and Garlic, Olive Oil and Onion, Olive Oil and Basil, Olive Oil and Cumin, Vegetable Oil and Garlic, Olive Oil and Parsley, Sesame Oil and Soy Sauce, Sesame Oil and Scallion, Sesame Oil and Garlic
- Spice + Spice - extremly popular. Examples are Ginger and Garlic, Coriander and Cumin, Cumin and Tumeric (curry), Coriander and Tumeric, Onion and Cumin, Onion and Tumeric, Cayenne and Cumin, Onion and Cayenne, Onion and Garlic, Cayenne and Garlic, and many more

As we already started to see, different cuisines use different ingredients and different approaches to food combinations.