Welcome to your DataCamp project audition! This notebook must be filled out and vetted before a contract can be signed and you can start creating your project.

The first step is forking the repository in which this notebook lives. After that, there are two parts to be completed in this notebook:

- **Project information**:  The title of the project, a project description, etc.

- **Project introduction**: The three first text and code cells that will form the introduction of your project.

When complete, please email the link to your forked repo to projects@datacamp.com with the email subject line _DataCamp project audition_. If you have any questions, please reach out to projects@datacamp.com.

# Project information

**Project title**: Computing and Visualizing Distance Matrices using Fruit Data

**Name:** Eric Hare

**Email address associated with your DataCamp account:** ericrhare@gmail.com

**GitHub username:** erichare

**Project description**: One of the most important unsupervised learning methods is *cluster analysis*, in which related groups of observations are derived in spite of the lack of a response variable. The key mathematical concept here is the idea of *distance*, to quantify how different (or similar) two observations are to one another. In this project, you will explore the computation of distances using several methods, including those designed for numeric, categorical, and mixed data.

A basic knowledge and comfort with R programming is a must. In addition, knowledge of matrix operations is a plus. We will be using R along with several companion packages such as cluster, dplyr, tidyr, and ggplot2. The primary function we will be working with is the dist() function from base R, as well as the daisy() function from the cluster package. Finally, comfort with the tidyverse series of packages and the pipe operator is a plus.

The data used is a dataset of fruit characteristics available at: https://github.com/OAITI/open-datasets/tree/master/Food%20Data/Fruits . The data includes physical properties of the fruit, as well as nutritional content in a particular serving size of that fruit.

# Project introduction

***Note: nothing needs to be filled out in this cell. It is simply setting up the template cells below.***

The final output of a DataCamp project looks like a blog post: pairs of text and code cells that tell a story about data. The text is written from the perspective of the data analyst and *not* from the perspective of an instructor on DataCamp. So, for this blog post intro, all you need to do is pretend like you're writing a blog post -- forget the part about instructors and students.

Below you'll see the structure of a DataCamp project: a series of "tasks" where each task consists of a title, a **single** text cell, and a **single** code cell. There are 8-12 tasks in a project and each task can have up to 10 lines of code. What you need to do:
1. Read through the template structure.
2. As best you can, divide your project as it is currently visualized in your mind into tasks.
3. Fill out the template structure for the first three tasks of your project.

As you are completing each task, you may wish to consult the project notebook format in our [documentation](https://instructor-support.datacamp.com/projects/datacamp-projects-jupyter-notebook). Only the `@context` and `@solution` cells are relevant to this audition.

## 1. Exploring the fruit data

Exploring physical and nutritional properties of fruits has importance in Agriculture, and in our every day lives! Some fruits are more nutritious, some more caloric, and they vary widely in terms of size, shape, and color. We will get our feet wet by exploring some of the data before diving into distance calculations.

![](img/watermelon.png)
![](img/orange.png)

Let's explore and compare **Watermelons** and **Oranges** in the data:

In [1]:
## Load the dplyr package
suppressPackageStartupMessages(library(dplyr))

## Read in the data
fruits <- read.csv("datasets/fruits.csv", stringsAsFactors = FALSE) %>%
    select(-X)

## Extract the relevant rows
fruits_sub <- fruits[fruits$Name == "Watermelon" | fruits$Name == "Orange",]
fruits_sub

## At the specified serving size, which fruit has more calories?
fruits_sub$Name[which.max(fruits_sub$Calories)]

## What is the difference in calories?
diff(fruits_sub$Calories)

## Which vitamin does a Watermelon contain more of?
fruits_sub[,c("Name", "VitaminA_mg")]

Unnamed: 0,Name,VitaminA_mg
26,Watermelon,438
30,Orange,295


## 2. Normalizing the numeric variables in the fruits data

Suppose I pose the following question: On which variable are watermelons and oranges **most different**? Our instinct might be to compute the differences for each numeric variable, and see which one has the biggest difference. But one issue with this approach is that the variables are not always on the same *scale*. For example, *Fiber* is in grams, but Vitamin A is in milligrams. And Calories is in an entirely different unit altogether. How can we determine the variable upon which they truly are the most different?

One way to handle this issue is by using **normalization** - By standardizing each variable (subtracting the mean, and dividing by the standard deviation) we account for the different scales of each variable and standardize them to a consistent for. Let's standardize our numeric variables now.

In [2]:
## Standardize each numeric variable
fruits_standardized <- fruits %>%
    mutate_if(funs(is.numeric), function(.) (. - mean(.)) / sd(.))
              
## Summarize the result
summary(fruits_standardized)
              
## Extract the relevant rows
fruits_standardsub <- fruits_standardized[fruits_standardized$Name == "Watermelon" | 
                                          fruits_standardized$Name == "Orange",]
              
## Get the standardized differences for each numeric variable
fruits_standardsub %>%
    summarise_if(funs(is.numeric), diff)

FruitWeight_g,Calories,Fiber_g,VitaminA_mg,VitaminC_mg,Potassium_mg,Folate_µg
1.152371,0.9738931,0.726072,-0.2182717,2.043704,1.08143,3.171294


## 3. Computing distance between watermelons, oranges, and mangoes

With standardized data, we are now able to begin thinking about **distances**. That is, how "far apart" are oranges and watermelons? To simplify things, we will answer this question only in terms of the *numeric* variables at this point in time.

The most simple method for distance between numeric variables is to use **Euclidean Distance**, which is defined as:

$d = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}$

Using the numeric variables, we compute this for our data between watermelons, oranges, and a new fruit, mangoes.

In [3]:
## Get the relevant columns and rows
fruits_standardnum <- fruits_standardized %>%
    slice(match(c("Watermelon", "Orange", "Mango"), Name)) %>%
    select_if(funs(is.numeric))
rownames(fruits_standardnum) <- c("Watermelon", "Orange", "Mango")

## Compute distance between watermelons and oranges
d1 <- sqrt(sum((fruits_standardnum[1,] - fruits_standardnum[2,])^2))

## Compute distance between watermelons and mangoes
d2 <- sqrt(sum((fruits_standardnum[1,] - fruits_standardnum[3,])^2))

## Compute distance between oranges and mangoes
d3 <- sqrt(sum((fruits_standardnum[2,] - fruits_standardnum[3,])^2))

## Which is most different?
c(d1, d2, d3) # Watermelons and oranges

## An easier way to do the above calculation!
dist(fruits_standardnum)

       Watermelon   Orange
Orange   4.272537         
Mango    1.386879 3.119389

*Stop here! Only the three first tasks. :)*