<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align=center><font size = 5>From Understanding to Preparation</font></h1>

## Introduction

In this lab, we will continue learning about the data science methodology, and focus on the **Data Understanding** and the **Data Preparation** stages.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
    
1. [Recap](#0)<br>
2. [Data Understanding](#2)<br>
3. [Data Preparation](#4)<br>
</div>
<hr>

# Recap <a id="0"></a>

In Lab **From Requirements to Collection**, we learned that the data we need to answer the question developed in the business understanding stage, namely *can we automate the process of determining the cuisine of a given recipe?*, is readily available. A researcher named Yong-Yeol Ahn scraped tens of thousands of food recipes (cuisines and ingredients) from three different websites, namely:

<img src="https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/images/lab3_fig1_allrecipes.png" width=500>

www.allrecipes.com

<img src="https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/images/lab3_fig2_epicurious.png" width=500>

www.epicurious.com

<img src="https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/images/lab3_fig3_menupan.png" width=500>

www.menupan.com

For more information on Yong-Yeol Ahn and his research, you can read his paper on [Flavor Network and the Principles of Food Pairing](http://yongyeol.com/papers/ahn-flavornet-2011.pdf).

We also collected the data and placed it on an IBM server for your convenience.

------------

# Data Understanding <a id="2"></a>

<img src="https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/images/lab3_fig4_flowchart_data_understanding.png" width=500>

<strong>Important note:</strong> Please note that you are not expected to know how to program in R. The following code is meant to illustrate the stages of data understanding and data preparation, so it is totally fine if you do not understand the individual lines of code. We have a full course on programming in R, <a href="http://cocl.us/RP0101EN_DS0103EN_LAB3_R">R101</a>, so please feel free to complete the course if you are interested in learning how to program in R.

### Using this notebook:

To run any of the following cells of code, you can type **Shift + Enter** to excute the code in a cell.

Get the version of R installed.

In [1]:
# check R version
R.Version()$version.string

Download the data from the IBM server.

In [2]:
# click here and press Shift + Enter
download.file("https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/data/recipes.csv",
              destfile = "/resources/data/recipes.csv", quiet = TRUE)

recipes <- read.csv("/resources/data/recipes.csv") # takes 30 seconds

Show the first few rows.

In [3]:
head(recipes)

Unnamed: 0_level_0,country,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,⋯,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,⋯,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>
1,Vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
2,Vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
3,Vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
4,Vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
5,Vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
6,Vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No


Get the dimensions of the dataframe.

In [4]:
nrow(recipes)

In [5]:
ncol(recipes)

So our dataset consists of 57,691 recipes. Each row represents a recipe, and for each recipe, the corresponding cuisine is documented as well as whether 384 ingredients exist in the recipe or not beginning with almond and ending with zucchini.

We know that a basic sushi recipe includes the ingredients:
* rice
* soy sauce
* wasabi
* some fish/vegetables

Let's check that these ingredients exist in our dataframe:

In [6]:
grep("rice", names(recipes), value = TRUE) # yes as rice
grep("wasabi", names(recipes), value = TRUE) # yes
grep("soy", names(recipes), value = TRUE) # yes as soy_sauce

Yes, they do!

* rice exists as rice.
* wasabi exists as wasabi.
* soy exists as soy_sauce.

So maybe if a recipe contains all three ingredients: rice, wasabi, and soy_sauce, then we can confidently say that the recipe is a **Japanese** cuisine! Let's keep this in mind!

----------------

# Data Preparation <a id="4"></a>

<img src="https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/images/lab3_fig5_flowchart_data_preparation.png" width=500>

In this section, we will prepare the data for the next stage in the data science methodology, which is modeling. This stage involves exploring the data further and making sure that it is in the right format for the machine learning algorithm that we selected in the analytic approach stage, which is decision trees.

First, look at the data to see if it needs cleaning.

In [7]:
base::table(recipes$country) # frequency table


                African                American                   asian 
                    115                   40150                      17 
                  Asian                 Austria              Bangladesh 
                   1176                      21                       4 
                Belgium            Cajun_Creole                  Canada 
                     11                     146                     774 
              Caribbean   Central_SouthAmerican                   China 
                    183                     241                     130 
                chinese                 Chinese              east_asian 
                     86                     226                     951 
           East-African          Eastern-Europe EasternEuropean_Russian 
                     11                     235                     146 
       English_Scottish                  France                  French 
                    204                     268   

By looking at the above table, we can make the following observations:

1. Cuisine column is labeled as Country, which is inaccurate.
2. Cuisine names are not consistent as not all of them start with an uppercase first letter.
3. Some cuisines are duplicated as variation of the country name, such as Vietnam and Vietnamese.
4. Some cuisines have very few recipes.

#### Let's fixes these problems.

Fix the name of the column showing the cuisine.

In [8]:
colnames(recipes)[1] = "cuisine"

Make all the cuisine names lowercase.

In [9]:
recipes$cuisine <- tolower(as.character(recipes$cuisine))

recipes

cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,⋯,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
<chr>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,⋯,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No


Make the cuisine names consistent.

In [10]:
recipes$cuisine[recipes$cuisine == "austria"] <- "austrian"
recipes$cuisine[recipes$cuisine == "belgium"] <- "belgian"
recipes$cuisine[recipes$cuisine == "china"] <- "chinese"
recipes$cuisine[recipes$cuisine == "canada"] <- "canadian"
recipes$cuisine[recipes$cuisine == "netherlands"] <- "dutch"
recipes$cuisine[recipes$cuisine == "france"] <- "french"
recipes$cuisine[recipes$cuisine == "germany"] <- "german"
recipes$cuisine[recipes$cuisine == "india"] <- "indian"
recipes$cuisine[recipes$cuisine == "indonesia"] <- "indonesian"
recipes$cuisine[recipes$cuisine == "iran"] <- "iranian"
recipes$cuisine[recipes$cuisine == "israel"] <- "jewish"
recipes$cuisine[recipes$cuisine == "italy"] <- "italian"
recipes$cuisine[recipes$cuisine == "japan"] <- "japanese"
recipes$cuisine[recipes$cuisine == "korea"] <- "korean"
recipes$cuisine[recipes$cuisine == "lebanon"] <- "lebanese"
recipes$cuisine[recipes$cuisine == "malaysia"] <- "malaysian"
recipes$cuisine[recipes$cuisine == "mexico"] <- "mexican"
recipes$cuisine[recipes$cuisine == "pakistan"] <- "pakistani"
recipes$cuisine[recipes$cuisine == "philippines"] <- "philippine"
recipes$cuisine[recipes$cuisine == "scandinavia"] <- "scandinavian"
recipes$cuisine[recipes$cuisine == "spain"] <- "spanish_portuguese"
recipes$cuisine[recipes$cuisine == "portugal"] <- "spanish_portuguese"
recipes$cuisine[recipes$cuisine == "switzerland"] <- "swiss"
recipes$cuisine[recipes$cuisine == "thailand"] <- "thai"
recipes$cuisine[recipes$cuisine == "turkey"] <- "turkish"
recipes$cuisine[recipes$cuisine == "irish"] <- "uk-and-irish"
recipes$cuisine[recipes$cuisine == "uk-and-ireland"] <- "uk-and-irish"
recipes$cuisine[recipes$cuisine == "vietnam"] <- "vietnamese"

recipes

cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,⋯,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
<chr>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,⋯,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No


Remove cuisines with < 50 recipes:

In [11]:
# sort the table of cuisines by descending order
t <- sort(base::table(recipes$cuisine), decreasing = T)

t


               american                 italian                 mexican 
                  40150                    3250                    2390 
                 french                   asian              east_asian 
                   1264                    1193                     951 
                 korean                canadian                  indian 
                    799                     774                     598 
                western                 chinese      spanish_portuguese 
                    450                     442                     416 
           uk-and-irish       southern_soulfood                  jewish 
                    368                     346                     329 
               japanese                  german           mediterranean 
                    320                     289                     289 
                   thai            scandinavian           middleeastern 
                    289                     250   

In [12]:
# get cuisines with >= 50 recipes
filter_list <- names(t[t >= 50])

filter_list

In [13]:
before <- nrow(recipes) # number of rows of original dataframe
print(paste0("Number of rows of original dataframe is ", before))

recipes <- recipes[recipes$cuisine %in% filter_list,]

after <- nrow(recipes)
print(paste0("Number of rows of processed dataframe is ", after))

print(paste0(before - after, " rows removed!"))

[1] "Number of rows of original dataframe is 57691"
[1] "Number of rows of processed dataframe is 57403"
[1] "288 rows removed!"


Convert all of the columns into factors. This is to run the classification model later.

In [14]:
recipes[,names(recipes)] <- lapply(recipes[,names(recipes)], as.factor)

recipes

Unnamed: 0_level_0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,⋯,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,⋯,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>
1,vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
2,vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
3,vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
4,vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
5,vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
6,vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
7,vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
8,vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
9,vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
10,vietnamese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No


In R, you can check the structure of your data using the **str** function. Let's check the structure of our dataframe **recipes**.

In [15]:
str(recipes)

'data.frame':	57403 obs. of  384 variables:
 $ cuisine                : Factor w/ 34 levels "african","american",..: 33 33 33 33 33 33 33 33 33 33 ...
 $ almond                 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ angelica               : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ anise                  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ anise_seed             : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ apple                  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ apple_brandy           : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ apricot                : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ armagnac               : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ artemisia              : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ artichoke              : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ asparagus       

#### Let's analyze the data a little more in order to learn the data better and note any interesting preliminary observations.

Run the following cell to get the recipes that contain **rice** *and* **soy** *and* **wasabi** *and* **seaweed**.

In [16]:
check_recipes <- recipes[
    recipes$rice == "Yes" &
    recipes$soy_sauce == "Yes" &
    recipes$wasabi == "Yes" &
    recipes$seaweed == "Yes",
]

check_recipes

Unnamed: 0_level_0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,⋯,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,⋯,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>
11307,japanese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
11322,japanese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,Yes,No,No,No,No,No
11362,japanese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
12172,asian,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,Yes,No,No,No,No,No
12386,asian,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
13011,asian,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
13160,asian,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
13514,japanese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
13587,japanese,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No
13626,east_asian,No,No,No,No,No,No,No,No,No,⋯,No,No,No,No,No,No,No,No,No,No


Based on the results of the above code, can we classify all recipes that contain **rice** *and* **soy** *and* **wasabi** *and* **seaweed** as **Japanese** recipes? Why?

Let's count the ingredients across all recipes.

In [17]:
# sum the row count when the value of the row in a column is equal to "Yes" (value of 2)
ingred <- unlist(
            lapply( recipes[, names(recipes)], function(x) sum(as.integer(x) == 2))
            )

# transpose the dataframe so that each row is an ingredient
ingred <- as.data.frame( t( as.data.frame(ingred) ))
                
ing_df <- data.frame("ingredient" = names(ingred), 
                     "count" = as.numeric(ingred[1,])
                    )[-1,]
                
ing_df

Unnamed: 0_level_0,ingredient,count
Unnamed: 0_level_1,<fct>,<dbl>
2,almond,2306
3,angelica,1
4,anise,223
5,anise_seed,87
6,apple,2422
7,apple_brandy,37
8,apricot,620
9,armagnac,11
10,artemisia,13
11,artichoke,391


Now we have a dataframe of ingredients and their total counts across all recipes. Let's sort this dataframe in descending order.

In [18]:
ing_df_sort <- ing_df[order(ing_df$count, decreasing = TRUE),]
rownames(ing_df_sort) <- 1:nrow(ing_df_sort)

ing_df_sort

Unnamed: 0_level_0,ingredient,count
Unnamed: 0_level_1,<fct>,<dbl>
1,egg,21025
2,wheat,20781
3,butter,20719
4,onion,18080
5,garlic,17353
6,milk,12870
7,vegetable_oil,11105
8,cream,10171
9,tomato,9920
10,olive_oil,9876


#### What are the 3 most popular ingredients?

However, note that there is a problem with the above table. There are ~40,000 American recipes in our dataset, which means that the data is biased towards American ingredients.

**Therefore**, let's compute a more objective summary of the ingredients by looking at the ingredients per cuisine.

#### Let's create a *profile* for each cuisine.

In other words, let's try to find out what ingredients Chinese people typically use, and what is **Canadian** food for example.

In [19]:
# create a dataframe of the counts of ingredients by cuisine, normalized by the number of 
# recipes pertaining to that cuisine
by_cuisine_norm <- aggregate(recipes, 
                        by = list(recipes$cuisine), 
                        FUN = function(x) round(sum(as.integer(x) == 2)/
                                                length(as.integer(x)),4))
# remove the unnecessary column "cuisine"
by_cuisine_norm <- by_cuisine_norm[,-2]

# rename the first column into "cuisine"
names(by_cuisine_norm)[1] <- "cuisine"
                            
head(by_cuisine_norm)

Unnamed: 0_level_0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,⋯,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,african,0.1565,0,0.0,0.0,0.0348,0.0,0.0696,0.0,0,⋯,0.0,0.0087,0.0435,0.0087,0.0174,0.0,0.0087,0.0174,0.0,0.0348
2,american,0.0406,0,0.003,0.0006,0.0521,0.0006,0.0113,0.0001,0,⋯,0.003,0.0069,0.0308,0.0148,0.011,0.0007,0.0014,0.0682,0.0169,0.0186
3,asian,0.0075,0,0.0008,0.0025,0.0126,0.0,0.005,0.0,0,⋯,0.0008,0.0017,0.0386,0.0017,0.1249,0.0,0.0017,0.0042,0.0109,0.0117
4,cajun_creole,0.0,0,0.0,0.0,0.0068,0.0,0.0,0.0,0,⋯,0.0,0.0068,0.0822,0.0,0.1918,0.0,0.0068,0.0342,0.0068,0.0
5,canadian,0.0362,0,0.0,0.0,0.0362,0.0,0.0026,0.0,0,⋯,0.0026,0.0039,0.0297,0.0207,0.0039,0.0,0.0013,0.0672,0.0194,0.0116
6,caribbean,0.0164,0,0.0109,0.0,0.0109,0.0,0.0,0.0,0,⋯,0.0,0.0,0.0601,0.0055,0.0,0.0,0.0,0.0273,0.0109,0.0164


As shown above, we have just created a dataframe where each row is a cuisine and each column (except for the first column) is an ingredient, and the row values represent the percentage of each ingredient in the corresponding cuisine.

**For example**:

* *almond* is present across 15.65% of all of the **African** recipes.
* *butter* is present across 38.11% of all of the **Canadian** recipes.

Let's print out the profile for each cuisine by displaying the top four ingredients in each cuisine.

In [20]:
for(nation in by_cuisine_norm$cuisine){
    x <- sort(by_cuisine_norm[by_cuisine_norm$cuisine == nation,][-1], decreasing = TRUE)
    cat(c(toupper(nation)))
    cat("\n")
    cat(paste0(names(x)[1:4], " (", round(x[1:4]*100,0), "%) "))
    cat("\n")
    cat("\n")
}

AFRICAN
onion (53%)  olive_oil (52%)  garlic (50%)  cumin (43%) 

AMERICAN
butter (41%)  egg (41%)  wheat (40%)  onion (29%) 

ASIAN
soy_sauce (50%)  ginger (49%)  garlic (48%)  rice (41%) 

CAJUN_CREOLE
onion (70%)  cayenne (56%)  garlic (49%)  butter (36%) 

CANADIAN
wheat (40%)  butter (38%)  egg (35%)  onion (34%) 

CARIBBEAN
onion (51%)  garlic (51%)  black_pepper (31%)  vegetable_oil (31%) 

CENTRAL_SOUTHAMERICAN
garlic (57%)  onion (54%)  cayenne (52%)  tomato (41%) 

CHINESE
soy_sauce (69%)  ginger (53%)  garlic (53%)  scallion (48%) 

EAST_ASIAN
garlic (55%)  soy_sauce (50%)  scallion (50%)  cayenne (48%) 

EASTERN-EUROPE
wheat (53%)  egg (52%)  butter (48%)  onion (45%) 

EASTERNEUROPEAN_RUSSIAN
butter (60%)  egg (51%)  wheat (49%)  onion (38%) 

ENGLISH_SCOTTISH
butter (67%)  wheat (62%)  egg (53%)  cream (41%) 

FRENCH
butter (50%)  egg (44%)  wheat (37%)  olive_oil (28%) 

GERMAN
wheat (65%)  egg (61%)  butter (47%)  onion (35%) 

GREEK
olive_oil (76%)  garlic (44%)  onion

At this point, we feel that we have understood the data well and the data is ready and is in the right format for modeling!

-----------

### Thank you for completing this lab!

This notebook was created by [Polong Lin](https://ca.linkedin.com/in/polonglin) and revised by [Alex Aklson](https://www.linkedin.com/in/aklson/). We hope you found this lab session interesting. Feel free to contact us if you have any questions!

This notebook is part of the free course on **Cognitive Class** called *Data Science Methodology*. If you accessed this notebook outside the course, you can take this free self-paced course, online by clicking [here](http://cocl.us/DS0103EN_LAB3_R).

<hr>

Copyright &copy; 2019 [Cognitive Class](https://cognitiveclass.ai/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).