<h1>From Understanding to Preparation</h1>


## Introduction

In this lab, we will continue learning about the data science methodology, and focus on the **Data Understanding** and the **Data Preparation** stages.

## Objectives

After complting this lab you will be able to:

* Understand Data 
* Prepare Data for analysis and inference


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
    
1. [Recap](#0)<br>
2. [Data Understanding](#2)<br>
3. [Data Preparation](#2)<br>
</div>
<hr>


# Recap <a id="0"></a>


In Lab **From Requirements to Collection**, we learned that the data we need to answer the question developed in the business understanding stage, namely *can we automate the process of determining the cuisine of a given recipe?*, is readily available. A researcher named Yong-Yeol Ahn scraped tens of thousands of food recipes (cuisines and ingredients) from three different websites, namely:


<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/lab2_fig3_allrecipes.png" width="500">
<div align="center">
www.allrecipes.com
</div>
<br/><br/>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/lab2_fig4_epicurious.png" width="500">
<div align="center">
www.epicurious.com
</div>
<br/><br/>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/lab2_fig5_menupan.png" width="500">
<div align="center">
www.menupan.com
</div>
<br/><br/>


For more information on Yong-Yeol Ahn and his research, you can read his paper on [Flavor Network and the Principles of Food Pairing](http://yongyeol.com/papers/ahn-flavornet-2011.pdf).


# Data Understanding <a id="2"></a>


<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%203/images/flowchart_data_understanding.png" width="500">


In [4]:
import pandas as pd

recipes = pd.read_csv('recipes.csv')

Show the first few rows.


In [5]:
recipes.head(3)

Unnamed: 0,country,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
1,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No
2,Vietnamese,No,No,No,No,No,No,No,No,No,...,No,No,No,No,No,No,No,No,No,No


Get the dimensions of the dataframe.


In [6]:
recipes.shape

(57691, 384)

So our dataset consists of 57,691 recipes. Each row represents a recipe, and for each recipe, the corresponding cuisine is documented as well as whether 384 ingredients exist in the recipe or not beginning with almond and ending with zucchini.


We know that a basic sushi recipe includes the ingredients:
* rice
* soy sauce
* wasabi
* some fish/vegetables


Let's check that these ingredients exist in our dataframe:


In [25]:
ingredients = recipes.columns
ind = recipes.columns.str.contains('rice')
ingredients[ind]

Index(['brown_rice', 'licorice', 'rice'], dtype='object')

In [24]:
ingredients = recipes.columns

checklist = ['rice', 'soy', 'wasabi']
for c in checklist:
    ind = ingredients.str.contains(c)
    print(ingredients[ind])

Index(['brown_rice', 'licorice', 'rice'], dtype='object')
Index(['soy_sauce', 'soybean', 'soybean_oil'], dtype='object')
Index(['wasabi'], dtype='object')


Yes, they do!

* rice exists as rice.
* wasabi exists as wasabi.
* soy exists as soy_sauce.

So maybe if a recipe contains all three ingredients: rice, wasabi, and soy_sauce, then we can confidently say that the recipe is a **Japanese** cuisine! Let's keep this in mind!

----------------


# Data Preparation <a id="4"></a>


<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%203/images/lab3_fig5_flowchart_data_preparation.png" width="500">


In this section, we will prepare the data for the next stage in the data science methodology, which is modeling. This stage involves exploring the data further and making sure that it is in the right format for the machine learning algorithm that we selected in the analytic approach stage, which is decision trees.


First, look at the data to see if it needs cleaning.


In [None]:
base::table(recipes$country) # frequency table

By looking at the above table, we can make the following observations:

1. Cuisine column is labeled as Country, which is inaccurate.
2. Cuisine names are not consistent as not all of them start with an uppercase first letter.
3. Some cuisines are duplicated as variation of the country name, such as Vietnam and Vietnamese.
4. Some cuisines have very few recipes.


#### Let's fixes these problems.


Fix the name of the column showing the cuisine.


In [None]:
colnames(recipes)[1] = "cuisine"

Make all the cuisine names lowercase.


In [None]:
recipes$cuisine <- tolower(as.character(recipes$cuisine))

recipes

Make the cuisine names consistent.


In [None]:
recipes$cuisine[recipes$cuisine == "austria"] <- "austrian"
recipes$cuisine[recipes$cuisine == "belgium"] <- "belgian"
recipes$cuisine[recipes$cuisine == "china"] <- "chinese"
recipes$cuisine[recipes$cuisine == "canada"] <- "canadian"
recipes$cuisine[recipes$cuisine == "netherlands"] <- "dutch"
recipes$cuisine[recipes$cuisine == "france"] <- "french"
recipes$cuisine[recipes$cuisine == "germany"] <- "german"
recipes$cuisine[recipes$cuisine == "india"] <- "indian"
recipes$cuisine[recipes$cuisine == "indonesia"] <- "indonesian"
recipes$cuisine[recipes$cuisine == "iran"] <- "iranian"
recipes$cuisine[recipes$cuisine == "israel"] <- "jewish"
recipes$cuisine[recipes$cuisine == "italy"] <- "italian"
recipes$cuisine[recipes$cuisine == "japan"] <- "japanese"
recipes$cuisine[recipes$cuisine == "korea"] <- "korean"
recipes$cuisine[recipes$cuisine == "lebanon"] <- "lebanese"
recipes$cuisine[recipes$cuisine == "malaysia"] <- "malaysian"
recipes$cuisine[recipes$cuisine == "mexico"] <- "mexican"
recipes$cuisine[recipes$cuisine == "pakistan"] <- "pakistani"
recipes$cuisine[recipes$cuisine == "philippines"] <- "philippine"
recipes$cuisine[recipes$cuisine == "scandinavia"] <- "scandinavian"
recipes$cuisine[recipes$cuisine == "spain"] <- "spanish_portuguese"
recipes$cuisine[recipes$cuisine == "portugal"] <- "spanish_portuguese"
recipes$cuisine[recipes$cuisine == "switzerland"] <- "swiss"
recipes$cuisine[recipes$cuisine == "thailand"] <- "thai"
recipes$cuisine[recipes$cuisine == "turkey"] <- "turkish"
recipes$cuisine[recipes$cuisine == "irish"] <- "uk-and-irish"
recipes$cuisine[recipes$cuisine == "uk-and-ireland"] <- "uk-and-irish"
recipes$cuisine[recipes$cuisine == "vietnam"] <- "vietnamese"

recipes

Remove cuisines with < 50 recipes:


In [None]:
# sort the table of cuisines by descending order
t <- sort(base::table(recipes$cuisine), decreasing = T)

t

In [None]:
# get cuisines with >= 50 recipes
filter_list <- names(t[t >= 50])

filter_list

In [None]:
before <- nrow(recipes) # number of rows of original dataframe
print(paste0("Number of rows of original dataframe is ", before))

recipes <- recipes[recipes$cuisine %in% filter_list,]

after <- nrow(recipes)
print(paste0("Number of rows of processed dataframe is ", after))

print(paste0(before - after, " rows removed!"))

Convert all of the columns into factors. This is to run the classification model later.


In [None]:
recipes[,names(recipes)] <- lapply(recipes[,names(recipes)], as.factor)

recipes

In R, you can check the structure of your data using the **str** function. Let's check the structure of our dataframe **recipes**.


In [None]:
str(recipes)

#### Let's analyze the data a little more in order to learn the data better and note any interesting preliminary observations.


Run the following cell to get the recipes that contain **rice** *and* **soy** *and* **wasabi** *and* **seaweed**.


In [None]:
check_recipes <- recipes[
    recipes$rice == "Yes" &
    recipes$soy_sauce == "Yes" &
    recipes$wasabi == "Yes" &
    recipes$seaweed == "Yes",
]

check_recipes

Based on the results of the above code, can we classify all recipes that contain **rice** *and* **soy** *and* **wasabi** *and* **seaweed** as **Japanese** recipes? Why?


<div align="right">
<a href="#p1" class="btn btn-default" data-toggle="collapse">Click here for the solution</a>
</div>
<div id="p1" class="collapse">
No, because other recipes such as <strong>Asian</strong> and <strong>East_Asian</strong> recipes also contain these ingredients.
</div>


Let's count the ingredients across all recipes.


In [None]:
# sum the row count when the value of the row in a column is equal to "Yes" (value of 2)
ingred <- unlist(
            lapply( recipes[, names(recipes)], function(x) sum(as.integer(x) == 2))
            )

# transpose the dataframe so that each row is an ingredient
ingred <- as.data.frame( t( as.data.frame(ingred) ))
                
ing_df <- data.frame("ingredient" = names(ingred), 
                     "count" = as.numeric(ingred[1,])
                    )[-1,]
                
ing_df

Now we have a dataframe of ingredients and their total counts across all recipes. Let's sort this dataframe in descending order.


In [None]:
ing_df_sort <- ing_df[order(ing_df$count, decreasing = TRUE),]
rownames(ing_df_sort) <- 1:nrow(ing_df_sort)

ing_df_sort

#### What are the 3 most popular ingredients?


<div align="right">
<a href="#p2" class="btn btn-default" data-toggle="collapse">Click here for the solution</a>
</div>
<div id="p2" class="collapse">
1. Egg with <strong>21,025</strong> occurrences. <br>
2. Wheat with <strong>20,781</strong> occurrences. <br>
3. Butter with <strong>20,719</strong> occurrences. <br>
</div>


However, note that there is a problem with the above table. There are ~40,000 American recipes in our dataset, which means that the data is biased towards American ingredients.


**Therefore**, let's compute a more objective summary of the ingredients by looking at the ingredients per cuisine.


#### Let's create a *profile* for each cuisine.

In other words, let's try to find out what ingredients Chinese people typically use, and what is **Canadian** food for example.


In [None]:
# create a dataframe of the counts of ingredients by cuisine, normalized by the number of 
# recipes pertaining to that cuisine
by_cuisine_norm <- aggregate(recipes, 
                        by = list(recipes$cuisine), 
                        FUN = function(x) round(sum(as.integer(x) == 2)/
                                                length(as.integer(x)),4))
# remove the unnecessary column "cuisine"
by_cuisine_norm <- by_cuisine_norm[,-2]

# rename the first column into "cuisine"
names(by_cuisine_norm)[1] <- "cuisine"
                            
head(by_cuisine_norm)

As shown above, we have just created a dataframe where each row is a cuisine and each column (except for the first column) is an ingredient, and the row values represent the percentage of each ingredient in the corresponding cuisine.

**For example**:

* *almond* is present across 15.65% of all of the **African** recipes.
* *butter* is present across 38.11% of all of the **Canadian** recipes.


Let's print out the profile for each cuisine by displaying the top four ingredients in each cuisine.


In [None]:
for(nation in by_cuisine_norm$cuisine){
    x <- sort(by_cuisine_norm[by_cuisine_norm$cuisine == nation,][-1], decreasing = TRUE)
    cat(c(toupper(nation)))
    cat("\n")
    cat(paste0(names(x)[1:4], " (", round(x[1:4]*100,0), "%) "))
    cat("\n")
    cat("\n")
}

At this point, we feel that we have understood the data well and the data is ready and is in the right format for modeling!

-----------
