<h1>From Understanding to Preparation</h1>


## Introduction

In this lab, we will continue learning about the data science methodology, and focus on the **Data Understanding** and the **Data Preparation** stages.

## Objectives

After complting this lab you will be able to:

* Understand Data 
* Prepare Data for analysis and inference


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
    
1. [Recap](#0)<br>
2. [Data Understanding](#2)<br>
3. [Data Preparation](#2)<br>
</div>
<hr>


# Recap <a id="0"></a>


In Lab **From Requirements to Collection**, we learned that the data we need to answer the question developed in the business understanding stage, namely *can we automate the process of determining the cuisine of a given recipe?*, is readily available. A researcher named Yong-Yeol Ahn scraped tens of thousands of food recipes (cuisines and ingredients) from three different websites, namely:


<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/lab2_fig3_allrecipes.png" width="500">
<div align="center">
www.allrecipes.com
</div>
<br/><br/>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/lab2_fig4_epicurious.png" width="500">
<div align="center">
www.epicurious.com
</div>
<br/><br/>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/lab2_fig5_menupan.png" width="500">
<div align="center">
www.menupan.com
</div>
<br/><br/>


For more information on Yong-Yeol Ahn and his research, you can read his paper on [Flavor Network and the Principles of Food Pairing](http://yongyeol.com/papers/ahn-flavornet-2011.pdf).


# Data Understanding <a id="2"></a>


<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%203/images/flowchart_data_understanding.png" width="500">


In [1]:
import pandas as pd

recipes = pd.read_csv('recipes.csv')
recipes

Unnamed: 0,country,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
1,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
2,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
3,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
4,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57686,Japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57687,Japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57688,Japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57689,Japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No


Show the first few rows.


In [2]:
recipes.head()

Unnamed: 0,country,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
1,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
2,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
3,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
4,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No


Get the dimensions of the dataframe.


In [3]:
recipes.shape

(57691, 384)

So our dataset consists of 57,691 recipes. Each row represents a recipe, and for each recipe, the corresponding cuisine is documented as well as whether 384 ingredients exist in the recipe or not beginning with almond and ending with zucchini.


We know that a basic sushi recipe includes the ingredients:
* rice
* soy sauce
* wasabi
* some fish/vegetables


Let's check that these ingredients exist in our dataframe:


In [4]:
recipes.iloc[:,1:]

Unnamed: 0,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,artichoke,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,No,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
1,No,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
2,No,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
3,No,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
4,No,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57686,No,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57687,No,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57688,No,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57689,No,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No


In [5]:
## YOUR CODE HERE

ingredients = recipes.iloc[:,1:].columns
ingredients

Index(['almond', 'angelica', 'anise', 'anise_seed', 'apple', 'apple_brandy',
       'apricot', 'armagnac', 'artemisia', 'artichoke',
       ...
       'whiskey', 'white_bread', 'white_wine', 'whole_grain_wheat_flour',
       'wine', 'wood', 'yam', 'yeast', 'yogurt', 'zucchini'],
      dtype='object', length=383)

In [6]:
x = ingredients.str.contains('wasabi')
ingredients[x]

Index(['wasabi'], dtype='object')

Yes, they do!

* rice exists as rice.
* wasabi exists as wasabi.
* soy exists as soy_sauce.

So maybe if a recipe contains all three ingredients: rice, wasabi, and soy_sauce, then we can confidently say that the recipe is a **Japanese** cuisine! Let's keep this in mind!

----------------


# Data Preparation <a id="4"></a>


<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%203/images/lab3_fig5_flowchart_data_preparation.png" width="500">


In this section, we will prepare the data for the next stage in the data science methodology, which is modeling. This stage involves exploring the data further and making sure that it is in the right format for the machine learning algorithm that we selected in the analytic approach stage, which is decision trees.


First, look at the data to see if it needs cleaning.


In [7]:
## YOUR CODE HERE

recipes.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57691 entries, 0 to 57690
Columns: 384 entries, country to zucchini
dtypes: object(384)
memory usage: 169.0+ MB


In [8]:
recipes.describe()

Unnamed: 0,country,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
count,57691,57691,57691,57691,57691,57691,57691,57691,57691,57691,...,57691,57691,57691,57691,57691,57691,57691,57691,57691,57691
unique,69,2,2,2,2,2,2,2,2,2,...,2,2,2,2,2,2,2,2,2,2
top,American,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
freq,40150,55362,57690,57467,57604,55258,57654,57065,57680,57678,...,57542,57319,55479,56957,56665,57658,57605,54289,56634,56584


In [9]:
recipes['country'].value_counts()

country
American        40150
Mexico           1754
Italian          1715
Italy            1461
Asian            1176
                ...  
Indonesia          12
Belgium            11
East-African       11
Israel              9
Bangladesh          4
Name: count, Length: 69, dtype: int64

In [10]:
with pd.option_context('display.max_row',100):
    print(recipes['country'].value_counts())

country
American                   40150
Mexico                      1754
Italian                     1715
Italy                       1461
Asian                       1176
French                       996
east_asian                   951
Canada                       774
korean                       767
Mexican                      622
western                      450
Southern_SoulFood            346
India                        324
Jewish                       320
Spanish_Portuguese           291
Mediterranean                289
UK-and-Ireland               282
Indian                       274
France                       268
MiddleEastern                248
Central_SouthAmerican        241
Germany                      237
Eastern-Europe               235
Chinese                      226
Greek                        225
English_Scottish             204
Caribbean                    183
Thai                         164
Scandinavia                  158
EasternEuropean_Russian      146
Ca

By looking at the above table, we can make the following observations:

1. Cuisine column is labeled as Country, which is inaccurate.
2. Cuisine names are not consistent as not all of them start with an uppercase first letter.
3. Some cuisines are duplicated as variation of the country name, such as Vietnam and Vietnamese.
4. Some cuisines have very few recipes.


#### Let's fixes these problems.


Fix the name of the column showing the cuisine.


In [11]:
## YOUR CODE HERE

col_map = {'country': 'cuisine'}
recipes.rename(columns=col_map, inplace=True)
recipes

Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
1,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
2,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
3,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
4,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57686,Japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57687,Japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57688,Japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57689,Japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No


Make all the cuisine names lowercase.


In [12]:
## YOUR CODE HERE

lower_case = lambda x: x.lower()

recipes['cuisine'] = recipes['cuisine'].map(lower_case)
recipes

Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
1,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
2,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
3,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
4,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57686,japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57687,japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57688,japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57689,japan,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No


Make the cuisine names consistent.


In [13]:
name_map = {'austria': 'austrian',
            'belgium': 'belgian',
            'china': 'chinese',
            'canada': 'canadian',
            'netherlands': 'dutch',
            'france': 'french',
            'germany': 'german',
            'india': 'indian',
            'indonesia': 'indonesian',
            'iran': 'iranian',
            'israel': 'jewish',
            'italy': 'italian',
            'japan': 'japanese',
            'korea': 'korean',
            'lebanon': 'lebanese',
            'malaysia': 'malaysian',
            'mexico': 'maxican',
            'pakistan': 'pakistani',
            'philippines': 'philippine',
            'scandinavia': 'scandinavian',
            'spain': 'spanish_portuguese',
            'portugal': 'spanish_portuguese',
            'switzerland': 'swiss',
            'thailand': 'thai',
            'turkey': 'turkish',
            'irish': 'uk_irish',
            'uk-and-ireland': 'uk_irish',
            'vietnam': 'vietnamese',
            }

In [14]:
## YOUR CODE HERE

recipes['cuisine'].replace(name_map, inplace=True)
recipes['cuisine'].value_counts()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  recipes['cuisine'].replace(name_map, inplace=True)


cuisine
american                   40150
italian                     3250
maxican                     1768
french                      1264
asian                       1193
east_asian                   951
korean                       799
canadian                     774
mexican                      622
indian                       598
western                      450
chinese                      442
spanish_portuguese           416
uk_irish                     368
southern_soulfood            346
jewish                       329
japanese                     320
thai                         289
german                       289
mediterranean                289
scandinavian                 250
middleeastern                248
central_southamerican        241
eastern-europe               235
greek                        225
english_scottish             204
caribbean                    183
easterneuropean_russian      146
cajun_creole                 146
moroccan                     137
af

Remove cuisines with < 50 recipes:


In [15]:
## YOUR CODE HERE

x = recipes['cuisine'].value_counts()
y = x[x >= 50]
y

cuisine
american                   40150
italian                     3250
maxican                     1768
french                      1264
asian                       1193
east_asian                   951
korean                       799
canadian                     774
mexican                      622
indian                       598
western                      450
chinese                      442
spanish_portuguese           416
uk_irish                     368
southern_soulfood            346
jewish                       329
japanese                     320
thai                         289
german                       289
mediterranean                289
scandinavian                 250
middleeastern                248
central_southamerican        241
eastern-europe               235
greek                        225
english_scottish             204
caribbean                    183
easterneuropean_russian      146
cajun_creole                 146
moroccan                     137
af

In [16]:
x[x >= 50].to_dict()

{'american': 40150,
 'italian': 3250,
 'maxican': 1768,
 'french': 1264,
 'asian': 1193,
 'east_asian': 951,
 'korean': 799,
 'canadian': 774,
 'mexican': 622,
 'indian': 598,
 'western': 450,
 'chinese': 442,
 'spanish_portuguese': 416,
 'uk_irish': 368,
 'southern_soulfood': 346,
 'jewish': 329,
 'japanese': 320,
 'thai': 289,
 'german': 289,
 'mediterranean': 289,
 'scandinavian': 250,
 'middleeastern': 248,
 'central_southamerican': 241,
 'eastern-europe': 235,
 'greek': 225,
 'english_scottish': 204,
 'caribbean': 183,
 'easterneuropean_russian': 146,
 'cajun_creole': 146,
 'moroccan': 137,
 'african': 115,
 'southwestern': 108,
 'south-america': 103,
 'vietnamese': 95,
 'north-african': 60}

In [17]:
x[x >= 50].to_dict().keys()

dict_keys(['american', 'italian', 'maxican', 'french', 'asian', 'east_asian', 'korean', 'canadian', 'mexican', 'indian', 'western', 'chinese', 'spanish_portuguese', 'uk_irish', 'southern_soulfood', 'jewish', 'japanese', 'thai', 'german', 'mediterranean', 'scandinavian', 'middleeastern', 'central_southamerican', 'eastern-europe', 'greek', 'english_scottish', 'caribbean', 'easterneuropean_russian', 'cajun_creole', 'moroccan', 'african', 'southwestern', 'south-america', 'vietnamese', 'north-african'])

In [18]:
filter_list = list(x[x >= 50].to_dict().keys())
filter_list

['american',
 'italian',
 'maxican',
 'french',
 'asian',
 'east_asian',
 'korean',
 'canadian',
 'mexican',
 'indian',
 'western',
 'chinese',
 'spanish_portuguese',
 'uk_irish',
 'southern_soulfood',
 'jewish',
 'japanese',
 'thai',
 'german',
 'mediterranean',
 'scandinavian',
 'middleeastern',
 'central_southamerican',
 'eastern-europe',
 'greek',
 'english_scottish',
 'caribbean',
 'easterneuropean_russian',
 'cajun_creole',
 'moroccan',
 'african',
 'southwestern',
 'south-america',
 'vietnamese',
 'north-african']

#### Let's analyze the data a little more in order to learn the data better and note any interesting preliminary observations.


In [23]:
## YOUR CODE HERE
recipes.info()

recipes.describe()

recipes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57691 entries, 0 to 57690
Columns: 384 entries, cuisine to zucchini
dtypes: object(384)
memory usage: 169.0+ MB


Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
1,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
2,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
3,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
4,vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57686,japanese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57687,japanese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57688,japanese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57689,japanese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No


In [25]:
## YOUR CODE HERE
recipes.loc[ (recipes['rice'] == 'Yes') 
    & (recipes['soy_sauce'] == 'Yes')
    & (recipes['wasabi'] == 'Yes')
    & (recipes['seaweed'] == 'Yes')
]


Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
11306,japanese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
11321,japanese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,Yes,No,No,No,No,No
11361,japanese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
12171,asian,No,No,No,No,No,No,No,No,No,...,No,No,No,No,Yes,No,No,No,No,No
12385,asian,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
13010,asian,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
13159,asian,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
13513,japanese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
13586,japanese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
13625,east_asian,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No


Run the following cell to get the recipes that contain **rice** *and* **soy** *and* **wasabi** *and* **seaweed**.


Based on the results of the above code, can we classify all recipes that contain **rice** *and* **soy** *and* **wasabi** *and* **seaweed** as **Japanese** recipes? Why?


Let's count the ingredients across all recipes.


In [45]:
## YOUR CODE HERE
ingredients 
recipes.loc[ recipes['almond'] == 'Yes' ]

Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
66,indian,Yes,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
71,indian,Yes,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,Yes,No
83,indian,Yes,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
103,indian,Yes,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
136,indian,Yes,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57585,austrian,Yes,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57586,austrian,Yes,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57591,austrian,Yes,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
57600,austrian,Yes,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No


Now we have a dataframe of ingredients and their total counts across all recipes. Let's sort this dataframe in descending order.


In [None]:
## YOUR CODE HERE



#### What are the 3 most popular ingredients?


In [None]:
## YOUR CODE HERE



However, note that there is a problem with the above table. There are ~40,000 American recipes in our dataset, which means that the data is biased towards American ingredients.


**Therefore**, let's compute a more objective summary of the ingredients by looking at the ingredients per cuisine.


#### Let's create a *profile* for each cuisine.

In other words, let's try to find out what ingredients Chinese people typically use, and what is Canadian food for example.


In [None]:
import numpy as np

def find_ingredients(df):
    all_ingred = df.shape[0]
    ingred = df.columns[1:]
    n_ingred = len(ingred)
    count_ingred = [0] * n_ingred
    for i in range(n_ingred):
        ingred_name = ingred[i]
        count_ingred[i] = df[ingred_name].sum()

    ingred_df = pd.DataFrame({'ingredient': ingred,
                             'count': count_ingred
                             })
    
    #   Only ingredients used  (count > 0)  
    ingred_df = ingred_df.loc[ingred_df['count'] > 0]
    ingred_df['norm'] = ingred_df['count']/all_ingred
 
    return ingred_df.sort_values('count', ascending=False)

In [None]:
## YOUR CODE HERE



As shown above, we have just created a dataframe where each row is a cuisine and each column (except for the first column) is an ingredient, and the row values represent the percentage of each ingredient in the corresponding cuisine.

**For example**:

* *soy_sauce* is present across 68.55% of all of the **Chinese** recipes.
* *butter* is present across 38.11% of all of the **Canadian** recipes.


Let's print out the profile for each cuisine by displaying the top four ingredients in each cuisine.


In [None]:
## YOUR CODE HERE



At this point, we feel that we have understood the data well and the data is ready and is in the right format for modeling!

-----------


### Update logs

| Commit                  | Date              | Author     |
|-------------------------|-------------------|------------|
| Initial commit          | 2023 JAN 06       | R. Promkam |

------