<h1>From Understanding to Preparation</h1>


## Introduction

In this lab, we will continue learning about the data science methodology, and focus on the **Data Understanding** and the **Data Preparation** stages.

## Objectives

After complting this lab you will be able to:

* Understand Data 
* Prepare Data for analysis and inference


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
    
1. [Recap](#0)<br>
2. [Data Understanding](#2)<br>
3. [Data Preparation](#2)<br>
</div>
<hr>


# Recap <a id="0"></a>


In Lab **From Requirements to Collection**, we learned that the data we need to answer the question developed in the business understanding stage, namely *can we automate the process of determining the cuisine of a given recipe?*, is readily available. A researcher named Yong-Yeol Ahn scraped tens of thousands of food recipes (cuisines and ingredients) from three different websites, namely:


<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/lab2_fig3_allrecipes.png" width="500">
<div align="center">
www.allrecipes.com
</div>
<br/><br/>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/lab2_fig4_epicurious.png" width="500">
<div align="center">
www.epicurious.com
</div>
<br/><br/>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/lab2_fig5_menupan.png" width="500">
<div align="center">
www.menupan.com
</div>
<br/><br/>


For more information on Yong-Yeol Ahn and his research, you can read his paper on [Flavor Network and the Principles of Food Pairing](http://yongyeol.com/papers/ahn-flavornet-2011.pdf).


# Data Understanding <a id="2"></a>


<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%203/images/flowchart_data_understanding.png" width="500">


In [None]:
import pandas as pd

recipes = pd.read_csv('recipes.csv')

Show the first few rows.


In [None]:
recipes.head()

Get the dimensions of the dataframe.


In [None]:
recipes.shape

So our dataset consists of 57,691 recipes. Each row represents a recipe, and for each recipe, the corresponding cuisine is documented as well as whether 384 ingredients exist in the recipe or not beginning with almond and ending with zucchini.


We know that a basic sushi recipe includes the ingredients:
* rice
* soy sauce
* wasabi
* some fish/vegetables


Let's check that these ingredients exist in our dataframe:


In [None]:
## YOUR CODE HERE



Yes, they do!

* rice exists as rice.
* wasabi exists as wasabi.
* soy exists as soy_sauce.

So maybe if a recipe contains all three ingredients: rice, wasabi, and soy_sauce, then we can confidently say that the recipe is a **Japanese** cuisine! Let's keep this in mind!

----------------


# Data Preparation <a id="4"></a>


<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%203/images/lab3_fig5_flowchart_data_preparation.png" width="500">


In this section, we will prepare the data for the next stage in the data science methodology, which is modeling. This stage involves exploring the data further and making sure that it is in the right format for the machine learning algorithm that we selected in the analytic approach stage, which is decision trees.


First, look at the data to see if it needs cleaning.


In [None]:
## YOUR CODE HERE




By looking at the above table, we can make the following observations:

1. Cuisine column is labeled as Country, which is inaccurate.
2. Cuisine names are not consistent as not all of them start with an uppercase first letter.
3. Some cuisines are duplicated as variation of the country name, such as Vietnam and Vietnamese.
4. Some cuisines have very few recipes.


#### Let's fixes these problems.


Fix the name of the column showing the cuisine.


In [1]:
## YOUR CODE HERE




Make all the cuisine names lowercase.


In [2]:
## YOUR CODE HERE



Make the cuisine names consistent.


In [None]:
name_map = {'austria': 'austrian',
            'belgium': 'belgian',
            'china': 'chinese',
            'canada': 'canadian',
            'netherlands': 'dutch',
            'france': 'french',
            'germany': 'german',
            'india': 'indian',
            'indonesia': 'indonesian',
            'iran': 'iranian',
            'israel': 'jewish',
            'italy': 'italian',
            'japan': 'japanese',
            'korea': 'korean',
            'lebanon': 'lebanese',
            'malaysia': 'malaysian',
            'mexico': 'maxican',
            'pakistan': 'pakistani',
            'philippines': 'philippine',
            'scandinavia': 'scandinavian',
            'spain': 'spanish_portuguese',
            'portugal': 'spanish_portuguese',
            'switzerland': 'swiss',
            'thailand': 'thai',
            'turkey': 'turkish',
            'irish': 'uk_irish',
            'uk-and-ireland': 'uk_irish',
            'vietnam': 'vietnamese',
            }

In [None]:
## YOUR CODE HERE



Remove cuisines with < 50 recipes:


In [3]:
## YOUR CODE HERE



#### Let's analyze the data a little more in order to learn the data better and note any interesting preliminary observations.


In [4]:
## YOUR CODE HERE



Run the following cell to get the recipes that contain **rice** *and* **soy** *and* **wasabi** *and* **seaweed**.


In [None]:
## YOUR CODE HERE



Based on the results of the above code, can we classify all recipes that contain **rice** *and* **soy** *and* **wasabi** *and* **seaweed** as **Japanese** recipes? Why?


Let's count the ingredients across all recipes.


In [None]:
## YOUR CODE HERE



Now we have a dataframe of ingredients and their total counts across all recipes. Let's sort this dataframe in descending order.


In [None]:
## YOUR CODE HERE



#### What are the 3 most popular ingredients?


In [None]:
## YOUR CODE HERE



However, note that there is a problem with the above table. There are ~40,000 American recipes in our dataset, which means that the data is biased towards American ingredients.


**Therefore**, let's compute a more objective summary of the ingredients by looking at the ingredients per cuisine.


#### Let's create a *profile* for each cuisine.

In other words, let's try to find out what ingredients Chinese people typically use, and what is Canadian food for example.


In [None]:
import numpy as np

def find_ingredients(df):
    all_ingred = df.shape[0]
    ingred = df.columns[1:]
    n_ingred = len(ingred)
    count_ingred = [0] * n_ingred
    for i in range(n_ingred):
        ingred_name = ingred[i]
        count_ingred[i] = df[ingred_name].sum()

    ingred_df = pd.DataFrame({'ingredient': ingred,
                             'count': count_ingred
                             })
    
    #   Only ingredients used  (count > 0)  
    ingred_df = ingred_df.loc[ingred_df['count'] > 0]
    ingred_df['norm'] = ingred_df['count']/all_ingred
 
    return ingred_df.sort_values('count', ascending=False)

In [None]:
## YOUR CODE HERE



As shown above, we have just created a dataframe where each row is a cuisine and each column (except for the first column) is an ingredient, and the row values represent the percentage of each ingredient in the corresponding cuisine.

**For example**:

* *soy_sauce* is present across 68.55% of all of the **Chinese** recipes.
* *butter* is present across 38.11% of all of the **Canadian** recipes.


Let's print out the profile for each cuisine by displaying the top four ingredients in each cuisine.


In [None]:
## YOUR CODE HERE



At this point, we feel that we have understood the data well and the data is ready and is in the right format for modeling!

-----------


### Update logs

| Commit                  | Date              | Author     |
|-------------------------|-------------------|------------|
| Initial commit          | 2023 JAN 06       | R. Promkam |

------