Welcome to your DataCamp project audition! This notebook must be filled out and vetted before a contract can be signed and you can start creating your project.

The first step is forking the repository in which this notebook lives. After that, there are two parts to be completed in this notebook:

- **Project information**:  The title of the project, a project description, etc.

- **Project introduction**: The three first text and code cells that will form the introduction of your project.

When complete, please email the link to your forked repo to projects@datacamp.com with the email subject line _DataCamp project audition_. If you have any questions, please reach out to projects@datacamp.com.

# Project information

**Project title**: Clustering from Start to Finish with Bustabit Online Gambling Data

**Name:** Eric Hare

**Email address associated with your DataCamp account:** ericrhare@gmail.com

**GitHub username:** erichare

**Project description**: Finding related groups of observations within a dataset is an extremely important part of Unsupervised Learning. In this project, we perform a full cluster analysis, beginning with the raw data. We proceed by deriving features used for the clustering, before performing a clustering and assessing the results with visualization techniques. Finally, we name and interpret the resulting clusters.

A basic knowledge and comfort with R programming is a must. In addition, knowledge of matrix operations is a plus. We will be using R along with several companion packages such as cluster, dplyr, tidyr, and ggplot2. Experience with data visualization in ggplot2 is useful as well.

We will be using data on **Bustabit** users, an online gambling platform in which users can bet a certain amount of Cryptocurrency in hopes of winning more money. The data includes information at the per-game level, including each user's amount bet, the amount won or lost, and the date of the game being played.

# Project introduction

***Note: nothing needs to be filled out in this cell. It is simply setting up the template cells below.***

The final output of a DataCamp project looks like a blog post: pairs of text and code cells that tell a story about data. The text is written from the perspective of the data analyst and *not* from the perspective of an instructor on DataCamp. So, for this blog post intro, all you need to do is pretend like you're writing a blog post -- forget the part about instructors and students.

Below you'll see the structure of a DataCamp project: a series of "tasks" where each task consists of a title, a **single** text cell, and a **single** code cell. There are 8-12 tasks in a project and each task can have up to 10 lines of code. What you need to do:
1. Read through the template structure.
2. As best you can, divide your project as it is currently visualized in your mind into tasks.
3. Fill out the template structure for the first three tasks of your project.

As you are completing each task, you may wish to consult the project notebook format in our [documentation](https://instructor-support.datacamp.com/projects/datacamp-projects-jupyter-notebook). Only the `@context` and `@solution` cells are relevant to this audition.

## 1. A preliminary look at the Bustabit data

The similarities and differences in the behaviors of different people has long been of interest, particularly in Psychology and other social science fields. We are going to focus on the behavior on **online gamblers** from a platform called <a href="https://www.bustabit.com" target="_blank">Bustabit</a>. There are a few basic rules to playing a game of Bustabit:

![](img/bustabit.png)

1. You bet a certain amount of money (in Bitcoin), and you win if you cash out before the game **busts**.
2. Your win is calculated by the multiplier value at the moment you cashed out. For example, if you bet 100 and if the value was 2.50x at the time you cashed out, you win 250.
3. The multiplier increases as time goes on, but if you wait too long to cash out, you may bust and lose your money.
4. Lastly, the house maintains slight advantages because in 1 out of every 100 games, everyone playing busts.

Let's begin by doing an exploratory dive into the Bustabit data...

In [8]:
## Load the dplyr package
library(dplyr)

## Read in the data
bustabit <- read.csv("datasets/bustabit_sub.csv", stringsAsFactors = FALSE)

## Look at the head of the data
head(bustabit)

## Who had the highest profit in a single game?
bustabit %>%
    arrange(desc(Profit)) %>%
    slice(1)

## What was the highest Multiplier value (BustedAt) ever achieved in a single game?
bustabit %>%
    arrange(desc(BustedAt)) %>%
    slice(1)

Id,GameID,Username,Bet,CashedOut,Bonus,Profit,BustedAt,PlayDate
19029273,3395044,Shadowshot,130,2,2.77,133.6,251025.1,2016-11-29T00:03:05Z


## 2. Deriving relevant features for clustering

The basic task at hand is to cluster the **players** of bustabit, but we have data at the per-game level. Therefore, what we must do is derive **features** that quantify player behavior in order to begin thinking about the relationship and similarity between groups of players. Some features we will create are:

1. The average multiplier at which the player cashes out
2. The standard deviation of the cashed out multiplier
3. The average bet
4. The standard deviation of the bets
5. The total losses over time for the player
6. The total winnings over time for the player
7. The number of individual games the player lost
8. The number of individual games the player won

With these variables, we will be able to potentially group together similar users based on their typical Bustabit gambling behavior.

In [9]:
## Create a clustered data
data_clus <- data %>% 
  mutate(CashedOut = ifelse(is.na(CashedOut), BustedAt + .01, CashedOut),
         Losses = ifelse(is.na(Profit), Bet * -1, 0),
         Winnings = ifelse(is.na(Profit), 0, Profit),
         GameLost = ifelse(is.na(Profit), 1, 0),
         GameWon = ifelse(is.na(Profit), 0, 1))   %>%
  select(CashedOut, Profit, Bet, Username, Losses, Winnings, GameLost, GameWon) %>%
  group_by(Username) %>%
  summarise(AverageCashedOut = mean(CashedOuSDBet = sd(Bet),
            TotalLosses = sum(Losses),
            TotalWinnings = sum(Winnings),
            GamesLost = sum(GameLost),
            GamesWon = sum(GameWon))

Error in UseMethod("mutate_"): no applicable method for 'mutate_' applied to an object of class "function"




## 3. Computing distance between watermelons, oranges, and mangos

With standardized data, we are now able to begin thinking about **distances**. That is, how "far apart" are oranges and watermelons? To simplify things, we will answer this question only in terms of the *numeric* variables at this point in time.

The most simple method for distance between numeric variables is to use **Euclidean Distance**, which is defined as:

$d = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}$

Using the numeric variables, we compute this for our data between watermelons, oranges, and a new fruit, Mangos.

In [51]:
## Get the relevant columns and rows
fruits_standardnum <- fruits_standardized %>%
    slice(match(c("Watermelon", "Orange", "Mango"), Name)) %>%
    select_if(funs(is.numeric))
rownames(fruits_standardnum) <- c("Watermelon", "Orange", "Mango")

## Compute distance between watermelons and oranges
d1 <- sqrt(sum((fruits_standardnum[1,] - fruits_standardnum[2,])^2))

## Compute distance between watermelons and mangos
d2 <- sqrt(sum((fruits_standardnum[1,] - fruits_standardnum[3,])^2))

## Compute distance between oranges and mangos
d3 <- sqrt(sum((fruits_standardnum[2,] - fruits_standardnum[3,])^2))

## Which is most different?
c(d1, d2, d3) # Watermelons and oranges

## An easier way to do the above calculation!
dist(fruits_standardnum)

       Watermelon   Orange
Orange   4.272537         
Mango    1.386879 3.119389

*Stop here! Only the three first tasks. :)*