<h1>From Understanding to Preparation</h1>


## Introduction

In this lab, we will continue learning about the data science methodology, and focus on the **Data Understanding** and the **Data Preparation** stages.

## Objectives

After complting this lab you will be able to:

* Understand Data 
* Prepare Data for analysis and inference


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
    
1. [Recap](#0)<br>
2. [Data Understanding](#2)<br>
3. [Data Preparation](#2)<br>
</div>
<hr>


# Recap <a id="0"></a>


In Lab **From Requirements to Collection**, we learned that the data we need to answer the question developed in the business understanding stage, namely *can we automate the process of determining the cuisine of a given recipe?*, is readily available. A researcher named Yong-Yeol Ahn scraped tens of thousands of food recipes (cuisines and ingredients) from three different websites, namely:


<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/lab2_fig3_allrecipes.png" width="500">
<div align="center">
www.allrecipes.com
</div>
<br/><br/>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/lab2_fig4_epicurious.png" width="500">
<div align="center">
www.epicurious.com
</div>
<br/><br/>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/lab2_fig5_menupan.png" width="500">
<div align="center">
www.menupan.com
</div>
<br/><br/>


For more information on Yong-Yeol Ahn and his research, you can read his paper on [Flavor Network and the Principles of Food Pairing](http://yongyeol.com/papers/ahn-flavornet-2011.pdf).


# Data Understanding <a id="2"></a>


<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%203/images/flowchart_data_understanding.png" width="500">


In [2]:
import pandas as pd

recipes = pd.read_csv('recipes.csv')

Show the first few rows.


In [3]:
recipes.head()

Unnamed: 0,country,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
1,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
2,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
3,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
4,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No


Get the dimensions of the dataframe.


In [6]:
recipes.shape

(57691, 384)

So our dataset consists of 57,691 recipes. Each row represents a recipe, and for each recipe, the corresponding cuisine is documented as well as whether 384 ingredients exist in the recipe or not beginning with almond and ending with zucchini.


We know that a basic sushi recipe includes the ingredients:
* rice
* soy sauce
* wasabi
* some fish/vegetables


Let's check that these ingredients exist in our dataframe:


In [7]:
ingredients = recipes.columns
ind = recipes.columns.str.contains('rice')

print(ingredients[ind])

Index(['brown_rice', 'licorice', 'rice'], dtype='object')


In [24]:
ingredients = recipes.columns

checklist = ['rice', 'soy', 'wasabi']
for c in checklist:
    ind = ingredients.str.contains(c)
    print(ingredients[ind])

Index(['brown_rice', 'licorice', 'rice'], dtype='object')
Index(['soy_sauce', 'soybean', 'soybean_oil'], dtype='object')
Index(['wasabi'], dtype='object')


Yes, they do!

* rice exists as rice.
* wasabi exists as wasabi.
* soy exists as soy_sauce.

So maybe if a recipe contains all three ingredients: rice, wasabi, and soy_sauce, then we can confidently say that the recipe is a **Japanese** cuisine! Let's keep this in mind!

----------------


# Data Preparation <a id="4"></a>


<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%203/images/lab3_fig5_flowchart_data_preparation.png" width="500">


In this section, we will prepare the data for the next stage in the data science methodology, which is modeling. This stage involves exploring the data further and making sure that it is in the right format for the machine learning algorithm that we selected in the analytic approach stage, which is decision trees.


First, look at the data to see if it needs cleaning.


In [16]:
recipes['country'].value_counts()

country
American        40150
Mexico           1754
Italian          1715
Italy            1461
Asian            1176
                ...  
Indonesia          12
Belgium            11
East-African       11
Israel              9
Bangladesh          4
Name: count, Length: 69, dtype: int64

In [17]:
pd.crosstab(recipes['country'], 'count')

col_0,count
country,Unnamed: 1_level_1
African,115
American,40150
Asian,1176
Austria,21
Bangladesh,4
...,...
italian,74
japanese,99
korean,767
mexico,14


In [23]:
with pd.option_context('display.max_rows', 100):
    display(pd.crosstab(recipes['country'], 'count').sort_values('country'))

col_0,count
country,Unnamed: 1_level_1
African,115
American,40150
Asian,1176
Austria,21
Bangladesh,4
Belgium,11
Cajun_Creole,146
Canada,774
Caribbean,183
Central_SouthAmerican,241


By looking at the above table, we can make the following observations:

1. Cuisine column is labeled as Country, which is inaccurate.
2. Cuisine names are not consistent as not all of them start with an uppercase first letter.
3. Some cuisines are duplicated as variation of the country name, such as Vietnam and Vietnamese.
4. Some cuisines have very few recipes.


#### Let's fixes these problems.


Fix the name of the column showing the cuisine.


In [27]:
col_map = {'country': 'cuisine'}
recipes.rename(columns=col_map)

Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
1,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
2,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
3,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
4,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57686,Japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57687,Japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57688,Japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57689,Japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No


In [28]:
recipes = recipes.rename(columns=col_map)

Make all the cuisine names lowercase.


In [33]:
'Test'.lower()

'test'

In [35]:
str_lower = lambda s: s.lower()
recipes['cuisine'].map(str_lower)

0        vietnamese
1        vietnamese
2        vietnamese
3        vietnamese
4        vietnamese
            ...    
57686         japan
57687         japan
57688         japan
57689         japan
57690         japan
Name: cuisine, Length: 57691, dtype: object

In [36]:
recipes['cuisine'].str.lower()

0        vietnamese
1        vietnamese
2        vietnamese
3        vietnamese
4        vietnamese
            ...    
57686         japan
57687         japan
57688         japan
57689         japan
57690         japan
Name: cuisine, Length: 57691, dtype: object

In [38]:
recipes['cuisine'] = recipes['cuisine'].map(str_lower)
recipes

Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
1,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
2,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
3,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
4,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57686,japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57687,japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57688,japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57689,japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No


Make the cuisine names consistent.


In [73]:
name_map = {'austria': 'austrian',
            'belgium': 'belgian',
            'china': 'chinese',
            'canada': 'canadian',
            'netherlands': 'dutch',
            'france': 'french',
            'germany': 'german',
            'india': 'indian',
            'indonesia': 'indonesian',
            'iran': 'iranian',
            'israel': 'jewish',
            'italy': 'italian',
            'japan': 'japanese',
            'korea': 'korean',
            'lebanon': 'lebanese',
            'malaysia': 'malaysian',
            'mexico': 'maxican',
            'pakistan': 'pakistani',
            'philippines': 'philippine',
            'scandinavia': 'scandinavian',
            'spain': 'spanish_portuguese',
            'portugal': 'spanish_portuguese',
            'switzerland': 'swiss',
            'thailand': 'thai',
            'turkey': 'turkish',
            'irish': 'uk_irish',
            'uk-and-ireland': 'uk_irish',
            'vietnam': 'vietnamese',
            }
pd.crosstab(recipes['cuisine'].replace(name_map), 'N')

col_0,N
cuisine,Unnamed: 1_level_1
african,115
american,40150
asian,1193
austrian,21
bangladesh,4
belgian,11
cajun_creole,146
canadian,774
caribbean,183
central_southamerican,241


In [76]:
recipes['cuisine'] = recipes['cuisine'].replace(name_map)

Remove cuisines with < 50 recipes:


In [92]:
t = recipes['cuisine'].value_counts()
t

cuisine
american                   40150
italian                     3250
maxican                     1768
french                      1264
asian                       1193
east_asian                   951
korean                       799
canadian                     774
mexican                      622
indian                       598
western                      450
chinese                      442
spanish_portuguese           416
uk_irish                     368
southern_soulfood            346
jewish                       329
japanese                     320
thai                         289
german                       289
mediterranean                289
scandinavian                 250
middleeastern                248
central_southamerican        241
eastern-europe               235
greek                        225
english_scottish             204
caribbean                    183
easterneuropean_russian      146
cajun_creole                 146
moroccan                     137
af

In [105]:
filter_t = t[t>50]
filter_t

cuisine
american                   40150
italian                     3250
maxican                     1768
french                      1264
asian                       1193
east_asian                   951
korean                       799
canadian                     774
mexican                      622
indian                       598
western                      450
chinese                      442
spanish_portuguese           416
uk_irish                     368
southern_soulfood            346
jewish                       329
japanese                     320
thai                         289
german                       289
mediterranean                289
scandinavian                 250
middleeastern                248
central_southamerican        241
eastern-europe               235
greek                        225
english_scottish             204
caribbean                    183
easterneuropean_russian      146
cajun_creole                 146
moroccan                     137
af

In [111]:
filter_list = list(filter_t.to_dict().keys())
filter_list

['american',
 'italian',
 'maxican',
 'french',
 'asian',
 'east_asian',
 'korean',
 'canadian',
 'mexican',
 'indian',
 'western',
 'chinese',
 'spanish_portuguese',
 'uk_irish',
 'southern_soulfood',
 'jewish',
 'japanese',
 'thai',
 'german',
 'mediterranean',
 'scandinavian',
 'middleeastern',
 'central_southamerican',
 'eastern-europe',
 'greek',
 'english_scottish',
 'caribbean',
 'easterneuropean_russian',
 'cajun_creole',
 'moroccan',
 'african',
 'southwestern',
 'south-america',
 'vietnamese',
 'north-african']

In [123]:
recipes.loc[recipes['cuisine'].isin(filter_list)]

Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
1,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
2,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
3,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
4,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57686,japanese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57687,japanese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57688,japanese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57689,japanese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No


In [137]:
recipes = recipes.loc[recipes['cuisine'].isin(filter_list)].reset_index(drop=True)
recipes

Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
1,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
2,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
3,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
4,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57398,japanese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57399,japanese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57400,japanese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57401,japanese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No


#### Let's analyze the data a little more in order to learn the data better and note any interesting preliminary observations.


In [141]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57403 entries, 0 to 57402
Columns: 384 entries, cuisine to zucchini
dtypes: object(384)
memory usage: 168.2+ MB


In [138]:
recipes.describe()

Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
count,57403,57403,57403,57403,57403,57403,57403,57403,57403,57403,...,57403,57403,57403,57403,57403,57403,57403,57403,57403,57403
unique,35,2,2,2,2,2,2,2,2,2,...,2,2,2,2,2,2,2,2,2,2
top,american,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
freq,40150,55097,57402,57180,57316,54981,57366,56783,57392,57390,...,57255,57033,55198,56672,56377,57370,57318,54018,56370,56301


In [146]:
recipes.isnull().sum().sum()

0

In [157]:
recipes['cuisine'] = recipes['cuisine'].astype('string')

In [185]:
ingred = recipes.columns.to_list()[1:]
conv_map = {'Yes': 1, 'No': 0}
for i in ingred:
    recipes[i] = recipes[i].replace(conv_map)
    recipes[i] = recipes[i].astype('bool')

In [186]:
recipes

Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,vietnamese,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,vietnamese,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,vietnamese,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,vietnamese,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,vietnamese,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57398,japanese,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
57399,japanese,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
57400,japanese,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
57401,japanese,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [187]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57403 entries, 0 to 57402
Columns: 384 entries, cuisine to zucchini
dtypes: bool(383), string(1)
memory usage: 21.4 MB


In [188]:
recipes.describe()

Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
count,57403,57403,57403,57403,57403,57403,57403,57403,57403,57403,...,57403,57403,57403,57403,57403,57403,57403,57403,57403,57403
unique,35,2,2,2,2,2,2,2,2,2,...,2,2,2,2,2,2,2,2,2,2
top,american,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
freq,40150,55097,57402,57180,57316,54981,57366,56783,57392,57390,...,57255,57033,55198,56672,56377,57370,57318,54018,56370,56301


Run the following cell to get the recipes that contain **rice** *and* **soy** *and* **wasabi** *and* **seaweed**.


In [211]:
recipes.loc[(recipes['rice'] == True) &
            (recipes['soy_sauce'] == True) &
            (recipes['wasabi'] == True) &
            (recipes['seaweed'] == True)
           ]

Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
11306,japanese,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
11321,japanese,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
11361,japanese,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
12171,asian,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
12385,asian,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
13010,asian,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
13159,asian,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
13513,japanese,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
13586,japanese,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
13625,east_asian,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


Based on the results of the above code, can we classify all recipes that contain **rice** *and* **soy** *and* **wasabi** *and* **seaweed** as **Japanese** recipes? Why?


Let's count the ingredients across all recipes.


In [246]:
n_ingred = len(ingred)
count_ingred = [0] * n_ingred
for i in range(n_ingred):
    ingred_name = ingred[i]
    count_ingred[i] = recipes[ingred_name].sum()

ingred_df = pd.DataFrame({'ingredient': ingred,
                         'count': count_ingred
                         })
ingred_df

Unnamed: 0,ingredient,count
0,almond,2306
1,angelica,1
2,anise,223
3,anise_seed,87
4,apple,2422
...,...,...
378,wood,33
379,yam,85
380,yeast,3385
381,yogurt,1033


Now we have a dataframe of ingredients and their total counts across all recipes. Let's sort this dataframe in descending order.


In [253]:
ingred_sorted_df = ingred_df.sort_values('count', ascending=False)
ingred_sorted_df

Unnamed: 0,ingredient,count
119,egg,21025
371,wheat,20781
51,butter,20719
230,onion,18080
135,garlic,17353
...,...,...
341,sturgeon_caviar,1
1,angelica,1
289,roasted_nut,1
215,muscat_grape,1


#### What are the 3 most popular ingredients?


In [254]:
ingred_sorted_df.head(3)

Unnamed: 0,ingredient,count
119,egg,21025
371,wheat,20781
51,butter,20719


However, note that there is a problem with the above table. There are ~40,000 American recipes in our dataset, which means that the data is biased towards American ingredients.


**Therefore**, let's compute a more objective summary of the ingredients by looking at the ingredients per cuisine.


#### Let's create a *profile* for each cuisine.

In other words, let's try to find out what ingredients Chinese people typically use, and what is Canadian food for example.


In [321]:
import numpy as np

def find_ingredients(df):
    all_ingred = df.shape[0]
    ingred = df.columns[1:]
    n_ingred = len(ingred)
    count_ingred = [0] * n_ingred
    for i in range(n_ingred):
        ingred_name = ingred[i]
        count_ingred[i] = df[ingred_name].sum()

    ingred_df = pd.DataFrame({'ingredient': ingred,
                             'count': count_ingred
                             })
    
    #   Only ingredients used  (count > 0)  
    ingred_df = ingred_df.loc[ingred_df['count'] > 0]
    ingred_df['norm'] = ingred_df['count']/all_ingred
 
    return ingred_df.sort_values('count', ascending=False)

In [322]:
recipes_chinese = recipes.loc[recipes.cuisine == 'chinese']
find_ingredients(recipes_chinese)

Unnamed: 0,ingredient,count,norm
330,soy_sauce,303,0.685520
139,ginger,236,0.533937
135,garlic,234,0.529412
311,scallion,213,0.481900
316,sesame_oil,175,0.395928
...,...,...,...
225,oatmeal,1,0.002262
226,octopus,1,0.002262
275,raisin,1,0.002262
299,rum,1,0.002262


In [323]:
recipes_canadian = recipes.loc[recipes.cuisine == 'canadian']
find_ingredients(recipes_canadian)

Unnamed: 0,ingredient,count,norm
371,wheat,306,0.395349
51,butter,295,0.381137
119,egg,274,0.354005
230,onion,266,0.343669
135,garlic,209,0.270026
...,...,...,...
99,cognac,1,0.001292
262,porcini,1,0.001292
261,popcorn,1,0.001292
259,pistachio,1,0.001292


As shown above, we have just created a dataframe where each row is a cuisine and each column (except for the first column) is an ingredient, and the row values represent the percentage of each ingredient in the corresponding cuisine.

**For example**:

* *soy_sauce* is present across 68.55% of all of the **Chinese** recipes.
* *butter* is present across 38.11% of all of the **Canadian** recipes.


Let's print out the profile for each cuisine by displaying the top four ingredients in each cuisine.


In [344]:
cuisines = list(recipes['cuisine'].sort_values().unique())
for c in cuisines:
    print(f'Cuisine profile: {c.title()}')
    df = find_ingredients(recipes.loc[recipes.cuisine == c])
    display(df)
    print('-'*80)

Cuisine profile: African


Unnamed: 0,ingredient,count,norm
230,onion,61,0.530435
229,olive_oil,60,0.521739
135,garlic,57,0.495652
112,cumin,49,0.426087
68,cayenne,41,0.356522
...,...,...,...
304,sage,1,0.008696
306,salmon,1,0.008696
66,cauliflower,1,0.008696
64,cassava,1,0.008696


--------------------------------------------------------------------------------
Cuisine profile: American


Unnamed: 0,ingredient,count,norm
51,butter,16525,0.411582
119,egg,16266,0.405131
371,wheat,15996,0.398406
230,onion,11777,0.293325
209,milk,10680,0.266002
...,...,...,...
291,roasted_pecan,1,0.000025
341,sturgeon_caviar,1,0.000025
343,sunflower_oil,1,0.000025
165,katsuobushi,1,0.000025


--------------------------------------------------------------------------------
Cuisine profile: Asian


Unnamed: 0,ingredient,count,norm
330,soy_sauce,592,0.496228
139,ginger,580,0.486169
135,garlic,572,0.479464
284,rice,493,0.413244
311,scallion,456,0.382230
...,...,...,...
273,quince,1,0.000838
271,provolone_cheese,1,0.000838
114,currant,1,0.000838
123,endive,1,0.000838


--------------------------------------------------------------------------------
Cuisine profile: Cajun_Creole


Unnamed: 0,ingredient,count,norm
230,onion,102,0.698630
68,cayenne,82,0.561644
135,garlic,71,0.486301
51,butter,53,0.363014
364,vegetable_oil,50,0.342466
...,...,...,...
248,peanut_oil,1,0.006849
249,pear,1,0.006849
262,porcini,1,0.006849
294,romano_cheese,1,0.006849


--------------------------------------------------------------------------------
Cuisine profile: Canadian


Unnamed: 0,ingredient,count,norm
371,wheat,306,0.395349
51,butter,295,0.381137
119,egg,274,0.354005
230,onion,266,0.343669
135,garlic,209,0.270026
...,...,...,...
99,cognac,1,0.001292
262,porcini,1,0.001292
261,popcorn,1,0.001292
259,pistachio,1,0.001292


--------------------------------------------------------------------------------
Cuisine profile: Caribbean


Unnamed: 0,ingredient,count,norm
230,onion,94,0.513661
135,garlic,93,0.508197
34,black_pepper,57,0.311475
364,vegetable_oil,57,0.311475
354,tomato,55,0.300546
...,...,...,...
317,sesame_seed,1,0.005464
120,egg_noodle,1,0.005464
327,smoked_sausage,1,0.005464
331,soybean,1,0.005464


--------------------------------------------------------------------------------
Cuisine profile: Central_Southamerican


Unnamed: 0,ingredient,count,norm
135,garlic,137,0.568465
230,onion,131,0.543568
68,cayenne,125,0.518672
354,tomato,100,0.414938
103,corn,79,0.327801
...,...,...,...
70,celery_oil,1,0.004149
324,smoke,1,0.004149
144,grapefruit,1,0.004149
106,cottage_cheese,1,0.004149


--------------------------------------------------------------------------------
Cuisine profile: Chinese


Unnamed: 0,ingredient,count,norm
330,soy_sauce,303,0.685520
139,ginger,236,0.533937
135,garlic,234,0.529412
311,scallion,213,0.481900
316,sesame_oil,175,0.395928
...,...,...,...
225,oatmeal,1,0.002262
226,octopus,1,0.002262
275,raisin,1,0.002262
299,rum,1,0.002262


--------------------------------------------------------------------------------
Cuisine profile: East_Asian


Unnamed: 0,ingredient,count,norm
135,garlic,525,0.552050
330,soy_sauce,479,0.503680
311,scallion,471,0.495268
68,cayenne,453,0.476341
316,sesame_oil,373,0.392219
...,...,...,...
228,olive,1,0.001052
94,cocoa,1,0.001052
233,orange_juice,1,0.001052
91,citrus_peel,1,0.001052


--------------------------------------------------------------------------------
Cuisine profile: Eastern-Europe


Unnamed: 0,ingredient,count,norm
371,wheat,125,0.531915
119,egg,123,0.523404
51,butter,113,0.480851
230,onion,106,0.451064
209,milk,64,0.272340
...,...,...,...
63,cashew,1,0.004255
310,savory,1,0.004255
59,cardamom,1,0.004255
317,sesame_seed,1,0.004255


--------------------------------------------------------------------------------
Cuisine profile: Easterneuropean_Russian


Unnamed: 0,ingredient,count,norm
51,butter,88,0.602740
119,egg,74,0.506849
371,wheat,72,0.493151
230,onion,56,0.383562
109,cream,49,0.335616
...,...,...,...
301,rye_bread,1,0.006849
302,rye_flour,1,0.006849
303,saffron,1,0.006849
306,salmon,1,0.006849


--------------------------------------------------------------------------------
Cuisine profile: English_Scottish


Unnamed: 0,ingredient,count,norm
51,butter,137,0.671569
371,wheat,127,0.622549
119,egg,109,0.534314
109,cream,84,0.411765
209,milk,72,0.352941
...,...,...,...
116,dill,1,0.004902
316,sesame_oil,1,0.004902
258,pineapple,1,0.004902
323,shrimp,1,0.004902


--------------------------------------------------------------------------------
Cuisine profile: French


Unnamed: 0,ingredient,count,norm
51,butter,633,0.500791
119,egg,558,0.441456
371,wheat,472,0.373418
229,olive_oil,353,0.279272
109,cream,347,0.274525
...,...,...,...
250,pear_brandy,1,0.000791
54,cabernet_sauvignon_wine,1,0.000791
335,squid,1,0.000791
168,kiwi,1,0.000791


--------------------------------------------------------------------------------
Cuisine profile: German


Unnamed: 0,ingredient,count,norm
371,wheat,187,0.647059
119,egg,175,0.605536
51,butter,137,0.474048
230,onion,100,0.346021
209,milk,86,0.297578
...,...,...,...
249,pear,1,0.003460
248,peanut_oil,1,0.003460
328,sour_cherry,1,0.003460
331,soybean,1,0.003460


--------------------------------------------------------------------------------
Cuisine profile: Greek


Unnamed: 0,ingredient,count,norm
229,olive_oil,171,0.760000
135,garlic,100,0.444444
230,onion,82,0.364444
178,lemon_juice,76,0.337778
354,tomato,75,0.333333
...,...,...,...
331,soybean,1,0.004444
120,egg_noodle,1,0.004444
90,citrus,1,0.004444
92,clam,1,0.004444


--------------------------------------------------------------------------------
Cuisine profile: Indian


Unnamed: 0,ingredient,count,norm
112,cumin,361,0.603679
359,turmeric,304,0.508361
230,onion,298,0.498328
102,coriander,286,0.478261
68,cayenne,284,0.474916
...,...,...,...
262,porcini,1,0.001672
260,plum,1,0.001672
30,bitter_orange,1,0.001672
352,thai_pepper,1,0.001672


--------------------------------------------------------------------------------
Cuisine profile: Italian


Unnamed: 0,ingredient,count,norm
229,olive_oil,1970,0.606154
135,garlic,1709,0.525846
354,tomato,1275,0.392308
230,onion,1062,0.326769
18,basil,1014,0.312000
...,...,...,...
285,roasted_almond,1,0.000308
175,leaf,1,0.000308
170,kumquat,1,0.000308
157,huckleberry,1,0.000308


--------------------------------------------------------------------------------
Cuisine profile: Japanese


Unnamed: 0,ingredient,count,norm
330,soy_sauce,182,0.568750
284,rice,142,0.443750
365,vinegar,118,0.368750
364,vegetable_oil,112,0.350000
305,sake,89,0.278125
...,...,...,...
320,shellfish,1,0.003125
196,macaroni,1,0.003125
192,lobster,1,0.003125
190,litchi,1,0.003125


--------------------------------------------------------------------------------
Cuisine profile: Jewish


Unnamed: 0,ingredient,count,norm
119,egg,193,0.586626
371,wheat,162,0.492401
51,butter,102,0.310030
230,onion,99,0.300912
364,vegetable_oil,89,0.270517
...,...,...,...
185,lima_bean,1,0.003040
77,cherry,1,0.003040
25,beer,1,0.003040
342,sumac,1,0.003040


--------------------------------------------------------------------------------
Cuisine profile: Korean


Unnamed: 0,ingredient,count,norm
135,garlic,472,0.590738
311,scallion,419,0.524406
68,cayenne,413,0.516896
330,soy_sauce,394,0.493116
316,sesame_oil,346,0.433041
...,...,...,...
203,maple_syrup,1,0.001252
312,scallop,1,0.001252
88,cilantro,1,0.001252
87,cider,1,0.001252


--------------------------------------------------------------------------------
Cuisine profile: Maxican


Unnamed: 0,ingredient,count,norm
68,cayenne,1316,0.744344
230,onion,1259,0.712104
135,garlic,1131,0.639706
354,tomato,1099,0.621606
112,cumin,583,0.329751
...,...,...,...
226,octopus,1,0.000566
225,oatmeal,1,0.000566
326,smoked_salmon,1,0.000566
210,milk_fat,1,0.000566


--------------------------------------------------------------------------------
Cuisine profile: Mediterranean


Unnamed: 0,ingredient,count,norm
229,olive_oil,230,0.795848
135,garlic,146,0.505190
230,onion,112,0.387543
354,tomato,101,0.349481
241,parsley,87,0.301038
...,...,...,...
272,pumpkin,1,0.003460
302,rye_flour,1,0.003460
312,scallop,1,0.003460
328,sour_cherry,1,0.003460


--------------------------------------------------------------------------------
Cuisine profile: Mexican


Unnamed: 0,ingredient,count,norm
68,cayenne,441,0.709003
230,onion,378,0.607717
135,garlic,353,0.567524
354,tomato,305,0.490354
88,cilantro,251,0.403537
...,...,...,...
335,squid,1,0.001608
148,guava,1,0.001608
40,blue_cheese,1,0.001608
349,tarragon,1,0.001608


--------------------------------------------------------------------------------
Cuisine profile: Middleeastern


Unnamed: 0,ingredient,count,norm
229,olive_oil,149,0.600806
135,garlic,116,0.467742
371,wheat,94,0.379032
178,lemon_juice,89,0.358871
230,onion,88,0.354839
...,...,...,...
265,pork_sausage,1,0.004032
295,root,1,0.004032
310,savory,1,0.004032
312,scallop,1,0.004032


--------------------------------------------------------------------------------
Cuisine profile: Moroccan


Unnamed: 0,ingredient,count,norm
229,olive_oil,100,0.729927
112,cumin,75,0.547445
230,onion,68,0.496350
135,garlic,63,0.459854
89,cinnamon,60,0.437956
...,...,...,...
242,parsnip,1,0.007299
248,peanut_oil,1,0.007299
273,quince,1,0.007299
274,radish,1,0.007299


--------------------------------------------------------------------------------
Cuisine profile: North-African


Unnamed: 0,ingredient,count,norm
230,onion,33,0.550000
229,olive_oil,30,0.500000
112,cumin,29,0.483333
135,garlic,28,0.466667
354,tomato,26,0.433333
...,...,...,...
2,anise,1,0.016667
298,rosemary,1,0.016667
300,rutabaga,1,0.016667
210,milk_fat,1,0.016667


--------------------------------------------------------------------------------
Cuisine profile: Scandinavian


Unnamed: 0,ingredient,count,norm
51,butter,160,0.640
371,wheat,145,0.580
119,egg,132,0.528
109,cream,72,0.288
209,milk,56,0.224
...,...,...,...
211,mint,1,0.004
312,scallop,1,0.004
197,mace,1,0.004
196,macaroni,1,0.004


--------------------------------------------------------------------------------
Cuisine profile: South-America


Unnamed: 0,ingredient,count,norm
230,onion,44,0.427184
135,garlic,38,0.368932
119,egg,36,0.349515
209,milk,32,0.310680
253,pepper,26,0.252427
...,...,...,...
258,pineapple,1,0.009709
260,plum,1,0.009709
266,port_wine,1,0.009709
272,pumpkin,1,0.009709


--------------------------------------------------------------------------------
Cuisine profile: Southern_Soulfood


Unnamed: 0,ingredient,count,norm
51,butter,200,0.578035
371,wheat,168,0.485549
119,egg,144,0.416185
103,corn,103,0.297688
230,onion,100,0.289017
...,...,...,...
95,coconut,1,0.002890
329,sour_milk,1,0.002890
257,pimento,1,0.002890
331,soybean,1,0.002890


--------------------------------------------------------------------------------
Cuisine profile: Southwestern


Unnamed: 0,ingredient,count,norm
68,cayenne,88,0.814815
135,garlic,67,0.620370
230,onion,66,0.611111
88,cilantro,56,0.518519
229,olive_oil,43,0.398148
...,...,...,...
108,cranberry,1,0.009259
95,coconut,1,0.009259
327,smoked_sausage,1,0.009259
94,cocoa,1,0.009259


--------------------------------------------------------------------------------
Cuisine profile: Spanish_Portuguese


Unnamed: 0,ingredient,count,norm
229,olive_oil,241,0.579327
135,garlic,226,0.543269
230,onion,195,0.468750
27,bell_pepper,147,0.353365
354,tomato,142,0.341346
...,...,...,...
95,coconut,1,0.002404
314,seaweed,1,0.002404
317,sesame_seed,1,0.002404
328,sour_cherry,1,0.002404


--------------------------------------------------------------------------------
Cuisine profile: Thai


Unnamed: 0,ingredient,count,norm
135,garlic,173,0.598616
129,fish,153,0.529412
68,cayenne,136,0.470588
88,cilantro,121,0.418685
139,ginger,114,0.394464
...,...,...,...
150,ham,1,0.003460
164,kale,1,0.003460
50,buckwheat,1,0.003460
286,roasted_beef,1,0.003460


--------------------------------------------------------------------------------
Cuisine profile: Uk_Irish


Unnamed: 0,ingredient,count,norm
51,butter,219,0.595109
371,wheat,214,0.581522
119,egg,177,0.480978
209,milk,123,0.334239
230,onion,108,0.293478
...,...,...,...
191,liver,1,0.002717
186,lime,1,0.002717
3,anise_seed,1,0.002717
179,lemon_peel,1,0.002717


--------------------------------------------------------------------------------
Cuisine profile: Vietnamese


Unnamed: 0,ingredient,count,norm
129,fish,70,0.736842
135,garlic,69,0.726316
284,rice,47,0.494737
88,cilantro,41,0.431579
68,cayenne,41,0.431579
...,...,...,...
238,palm,1,0.010526
251,pecan,1,0.010526
264,pork_liver,1,0.010526
267,potato,1,0.010526


--------------------------------------------------------------------------------
Cuisine profile: Western


Unnamed: 0,ingredient,count,norm
119,egg,231,0.513333
371,wheat,208,0.462222
51,butter,207,0.460000
34,black_pepper,164,0.364444
230,onion,138,0.306667
...,...,...,...
203,maple_syrup,1,0.002222
339,strawberry_jam,1,0.002222
202,mango,1,0.002222
200,mandarin,1,0.002222


--------------------------------------------------------------------------------


In [345]:
recipes.to_pickle('recipes.pickle')

At this point, we feel that we have understood the data well and the data is ready and is in the right format for modeling!

-----------


### Update logs

| Commit                  | Date              | Author     |
|-------------------------|-------------------|------------|
| Initial commit          | 2023 JAN 06       | R. Promkam |

------