### Run this to get the data now

In [None]:
#Click here and press Shift+Enter
download.file("https://ibm.box.com/shared/static/5wah9atr5o1akuuavl2z9tkjzdinr1lv.csv",
              destfile = "/resources/data/recipes.csv", quiet = TRUE)

## Install libraries if not installed
if("rpart" %in% rownames(installed.packages()) == FALSE) {install.packages("rpart", 
    repo = "http://mirror.las.iastate.edu/CRAN/")}
if("rpart.plot" %in% rownames(installed.packages()) == FALSE) {install.packages("rpart.plot", 
    repo = "http://mirror.las.iastate.edu/CRAN/")}

print("Done") #Takes about 30 seconds

<hr>

<a href="https://www.bigdatauniversity.com"><img src = "https://ibm.box.com/shared/static/wbqvbi6o6ip0vz55ua5gp17g4f1k7ve9.png" width = 400> </a>


<h1 align=center> Data Science Methodology</h1>
<h1 align=center> With Decision Trees and Clustering</h1>
<h4 align=center><a href = "https://ca.linkedin.com/in/polonglin">Polong Lin</a></h4>


### Table of contents:

1. Import `recipes.csv` into R.
2. Data understanding & data preparation
    - what does the data look like?
        - look at the data
    - what can we tell about the data?
        - summarize, visualize it
    - is the data clean? if not?
        - clean the data (i.e., inconsistent country names)
    - which cuisines are most similar to each other? (k-means clustering)

<hr>

<hr>

### Using this notebook:

**Shift + Enter** to run a cell:

In [None]:
# Check R version
R.Version()$version.string

<hr>

## 1. Import recipes.csv into R.

**Note**: If you'd like to download the data to your own computer (optional), you can access the file here: 
> **`recipes.csv`** (64.2 MB)  
> http://bit.ly/recipesdata

Run the cell using Shift + Enter

In [None]:
recipes <- read.csv("/resources/data/recipes.csv") #takes 10 sec

**Tip**: To create a new code cell, in the top menu, go to *Insert* -> *Insert Cell Below*.

<hr>

<h1 align = center> 2. Data understanding & preparation </h1>
<br>
<br>
<img src = https://ibm.box.com/shared/static/ctv4qau0q7ny0af4jp8mwi50l8fehmsz. width = 600>

### Show the first few rows

In [None]:
head(recipes)

#### How many rows, columns in total?

In [None]:
nrow(recipes)

In [None]:
ncol(recipes)

### The ingredients in sushi

To make sushi you need:
- rice
- soy sauce
- wasabi
- some fish/vegetables

Let's check that these ingredients exist in our dataframe:

In [None]:
grep("rice", names(recipes), value = TRUE) #Yes as rice
grep("wasabi", names(recipes), value = TRUE) #Yes
grep("soy", names(recipes), value = TRUE) #Yes as soy_sauce

Yes, they do!
So maybe... if a recipe contains all three (rice, wasabi, soy_sauce), then it might be for sushi, which might make it Japanese! Let's keep this in mind!

<br>
#### Okay let's look at the data
First, look at the data to see if it needs cleaning:

In [None]:
base::table(recipes$country) #frequency table

#### Let's sort the table: Which countries have the most number of recipes in this dataset?

In [None]:
t <- base::table(recipes$country)  # notice any data quality problems?
sort(t, decreasing = T)

<br> 

### Problem: Inconsistent country names (by case and by name)

**Goals:**
- convert all country names to _lowercase_
- make country names consistent
- convert all ingredient columns into factors (for classification)

Make all the country names **lowercase**:

In [None]:
#Run this
recipes$country <- tolower( as.character(recipes$country) ) 

Make the country names **consistent**:

In [None]:
#Run this
recipes$country[recipes$country == "china"] <- "chinese"
recipes$country[recipes$country == "france"] <- "french"
recipes$country[recipes$country == "germany"] <- "german"
recipes$country[recipes$country == "india"] <- "indian"
recipes$country[recipes$country == "israel"] <- "jewish"
recipes$country[recipes$country == "italy"] <- "italian"
recipes$country[recipes$country == "japan"] <- "japanese"
recipes$country[recipes$country == "korea"] <- "korean"
recipes$country[recipes$country == "mexico"] <- "mexican"
recipes$country[recipes$country == "scandinavia"] <- "scandinavian"
recipes$country[recipes$country == "thailand"] <- "thai"
recipes$country[recipes$country == "vietnam"] <- "vietnamese"

### Problem: Some countries have very few recipes

Remove data for countries with <50 recipes:

In [None]:
t <- sort(base::table(recipes$country), decreasing = T)

In [None]:
filter_list <- names( t[ t >= 50 ] )

before <- nrow(recipes) #number of rows of original df

recipes <- recipes[recipes$country %in% filter_list, ]

after <- nrow(recipes)

print(paste(before - after, "rows removed."))

recipes$country <- as.factor(as.character(recipes$country))

sort(base::table(recipes$country), decreasing = T)

#### Convert all of the columns into factors (to run the classification model later)

In [None]:
#Run this
recipes[,names(recipes)] <- lapply(recipes[,names(recipes)] , as.factor)

### Check the data:

**YOUR TURN:**
- check the **structure** of your data
- to do so, use the following code on **`recipes`**:
> `str(recipes)`

In [None]:
### TYPE YOUR CODE BELOW then press Shift+Enter to run it ###




## Can we tell that some food is Japanese if it contains rice _and_ soy sauce _and_ wasabi _and_ seaweed?

In [None]:
checkjapan <- recipes[recipes$rice == "Yes" &
                  recipes$soy_sauce == "Yes" &
                  recipes$wasabi == "Yes" &
                  recipes$seaweed == "Yes",]
checkjapan

<br> 
### Q: Which ingredients are most common? Which are the least-used ingredients?

**Goals:**
- count the ingredients across all recipes

Go ahead and run the cell below:

In [None]:
# Run this

## Sum the row count when the value of the row in a column is equal to "Yes" (value of 2)
ingred <- unlist(
            lapply( recipes[, names(recipes)], function(x) sum(as.integer(x) == 2))
            )

## Transpose the dataframe so that each row is an ingredient
ingred <- as.data.frame( t( as.data.frame(ingred) ))
                
ing_df <- data.frame("ingredient" = names(ingred), 
                     "count" = as.numeric(ingred[1,])
                    )[-1,]

Now we have a dataframe of ingredients and their total counts across all recipes. This dataframe needs to be **sorted**.


**Which ingredients are most popular?**

In [None]:
# Run this to sort the df
ing_df_sort <- ing_df[order(ing_df$count, decreasing = TRUE),]
rownames(ing_df_sort) <- 1:nrow(ing_df_sort)
ing_df_sort

Note that there is a problem with the above table. Did you notice? It's because the ingredient counts are across all the recipes -- but most of the recipes are American! This means that the data is biased towards American ingredients.

#### But our list was across _all_ recipes. What about the ingredients used per country?

<hr>

### Q: How does the distribution of ingredients differ between countries?

What is the ingredient "profile" of each country?  
- What ingredients do Chinese people typically use?  
- What _is_ "Canadian food" anyway?  

**Goals:**
- Find counts of ingredients by country, normalized by the number of recipes in that country
    - have one row for each country, one column for each ingredient
    - for each country, for each ingredient, show percentage of recipes (in that country) that contains that ingredient
    - make it into a dataframe
- Find top-used ingredients by country

Go ahead and run the cell below.

In [None]:
by_country_norm <- aggregate(recipes, 
                        by = list(recipes$country), 
                        FUN = function(x) round(sum(as.integer(x) == 2)/
                                                length(as.integer(x)),4))
#Remove the unnecessary column "country"
by_country_norm <- by_country_norm[,-2]

#Rename the first column into "country"
names(by_country_norm)[1] <- "country"

We have just created a dataframe where each row is an ingredient, each column is a country, and the row values contain the presence of that ingredient (as a ratio) across the recipes belonging to that country.

Let's take a look at the dataframe.

**YOUR TURN!**
- use **`head(df)`** to show the first 6 rows of a dataframe `df`
- dataframe: `by_country_norm`

_E.g., "almond" is present across 15.65% of all of the "african" recipes_

#### TYPE YOUR CODE BELOW ###
> head(by_country_norm)

In [None]:
## YOUR CODE HERE



### What is "Canadian food"?
> `region <- "canada"`  
> `regiondata <- by_country_norm[by_country_norm$country == region,]`  
> `t(sort(regiondata, decreasing = TRUE))`

In [None]:
region <- "canada" #select a country
regiondata <- by_country_norm[by_country_norm$country == region,]
t(sort(regiondata, decreasing = TRUE))


#### Challenge: 
Come up with a way to visualize this data in an interesting way. Tweet us [@bigdatau](https://twitter.com/intent/tweet?ref_src=twsrc%5Etfw&amp;text=%23rstats%20%23datascience%20%40BigDataU)!

### Q: What are the top three most popular ingredients in each country?

**Goals:**
- for each country, return the top-5 most prevalent ingredients

In [None]:
for(nation in by_country_norm$country){
    x <- sort(by_country_norm[by_country_norm$country == nation,][-1], decreasing = TRUE)
    cat(c(toupper(nation)))
    cat("\n")
    cat(paste0(names(x)[2:5], " (", round(x[2:5]*100,0), "%) "))
    cat("\n")
    cat("\n")
}

<hr>

<h2 align = center>Which cuisines are most similar to each other?</h2>


Set number of clusters:

In [1]:
n_cluster = 9 #you can change this value (e.g., to 3 clusters)

Run k-means clustering:

In [None]:
df_cluster <- by_country_norm
k <- kmeans(df_cluster[,-1], n_cluster)
df_cluster$cluster <- k$cluster

Print results:

In [None]:
for(i in seq( 1, n_cluster )){
    i <- as.character(i)
    cat(paste0("[Cluster ", i, "]----------------------------------------------------------"))
    cat("\n")
        
    print(paste0(as.character(df_cluster[df_cluster$cluster == i,]$country)))
    cat("\n")
    }

<hr>
**References**  
Recipes dataset adapted from: 
- [Ahn, Yong-Yeol, et al. "Flavor network and the principles of food pairing." Scientific reports 1 (2011).](http://yongyeol.com/papers/ahn-flavornet-2011.pdf)
- ^ Dataset on ingredient-compounds also available