## RLab 7:
In this R Lab assignment, we will  use K-means clustering algorithm for dimension reduction and data labeling purposes. Please read the following two chapters for the details. 

- HOML CH 20 K-Means Clustering
- HOML CH 21 Hierarchical Clustering

# Soccer Data

We have the following publicly available soccer data from FiveThirtyEight website. FiveThirtyEight  has data on on player performances across 16 metrics in World Cup from 1966 to 2018. FiveThirtyEight's Modeled Event Soccer Similarity Index (MESSI) is a system that evaluates and compares player performances across 16 metrics. For moere information, please visit the following website: https://projects.fivethirtyeight.com/world-cup-comparisons/.



- The data is in panel format: players across years. Hence, the same player's statistics can appear in different rows for different Worls Cup tournament. 
- There are 5899 rows of data from 14 World Cup tournament on 4533 soccer players from 80 different countries.  

- Each metric is measured on a per-match basis, and for each metric, FiveThirtyEight calculates a z-score — the number of standard deviations above or below average for that World Cup. Any metric above 0 indicates that the player performed better than the average player, vice versa. 

Variables:


- **player**: name of the player
- **season**: World cup tournament year
- **team**  : Country       
- **goals_z**: Standardized goals
- **xg_z**: Standardized expected goals
- **crosses_z**: Standardized crosses
- **boxtouches_z**: Standardized touches in box
- **passes_z**: Standardized passes
- **progpasses_z**: Standardized progressive passes
- **takeons_z**: Standardized take-ons
- **progruns_z**: Standardized progressive dribbles
- **tackles_z** : Standardized tackles    
- **interceptions_z**: Standardized interceptions
- **clearances_z**: Standardized clearences
- **blocks_z**: Standardized blocks
- **aerials_z**: Standardized others
- **fouls_z**: Standardized fouls committed
- **fouled_z**: Standardized fouls drawn
- **nsxg_z**: Standardized non-shot expected goals 


# Soccer game 
In case you don't know, a soccer game is played by two teams. Each time is allowed no more than 11 players on the field at any one time and there is a goalkeeper to prevent the other team from scorin. As a last line of defence, only the goalkeeper is allowed to use their hands (Yes, it is a strange rule). The remaining players can only use their feet,  head or chest to play the ball. 

If you Google "Soccer positions", you will get the following important positons on the field. 

- Goalkeeper
- Central Defender
- Left and Right Full-back (or Wingback)
- Central Midfielder
- Left and Right Midfielder
- Forward
- Striker



In [76]:
#Call the packages needed 
library(tidyverse)
library(dplyr)
library(ggplot2)
library(class)
library(testthat)





player_data<-read.csv("soccer.csv")



# Best scorers in 2018

player_data%>%
     filter(season==2018)%>%
     select(player, goals_z)%>%
     arrange(desc(goals_z))%>%head()


Unnamed: 0_level_0,player,goals_z
Unnamed: 0_level_1,<fct>,<dbl>
1,Cristiano Ronaldo,6.46
2,Harry Kane,5.21
3,Denis Cheryshev,4.74
4,Yerry Mina,4.34
5,Diego Costa,4.34
6,Mile Jedinak,4.17


In [77]:

# Best scorers from USA
player_data%>%
     filter(team=="USA")%>%
     select(player, goals_z, season)%>%
     arrange(desc(goals_z))%>%head()

Unnamed: 0_level_0,player,goals_z,season
Unnamed: 0_level_1,<fct>,<dbl>,<int>
1,Landon Donovan,4.39,2010
2,Clint Dempsey,2.6,2014
3,Brian McBride,2.13,2002
4,Landon Donovan,2.13,2002
5,Clint Dempsey,2.08,2006
6,Bruce Murray,1.9,1990


# Exercise 1:
 
Before clustering our data, we will save all numerical columns in **player_data**  under a different name. 

- Create a new data frame and name it as **player_attributes** to store the 16  numerical columns from **player_data** dataset. 

In [78]:


# Exercise #1: Create player_attributes dataset

# your code here

 player_attributes <- player_data %>% select(goals_z, xg_z, crosses_z, boxtouches_z, passes_z, progpasses_z, takeons_z, progruns_z, tackles_z, interceptions_z, clearances_z, blocks_z, aerials_z, fouls_z,
                  fouled_z, nsxg_z,  )


In [79]:
# Test your code in here
### BEGIN HIDDEN TEST

test_that("Check the player_attributes dataset", {
    expect_equal( IQR(player_attributes$blocks_z),0.79)
        

    expect_equal(  ncol(player_attributes),16)

    expect_equal(  min(player_attributes[3333,]),-1.18)
})


print("Passed! This was easy!")

### END HIDDEN TEST

[1] "Passed! This was easy!"


# Exercise 2:  k-means Clustering


In k-means clustering, if we want to create 5 clusters, the data will be clustered by the k-means method by partitioning the points into 5 groups in such a way that the sum of squares from points to the assigned cluster centres is minimized. 

In this exercise, we will use **class** package in R to use **kmeans** function to cluster our data. Data in **player_attributes** are all numerical and scaled, so no need to pre process the data. 

By using **set.seed(4230)**, cluster **player_attributes** data by imposing 5 clusters. Use 50 number of  starts (nstart=50).  Name your cluster object as **five_cluster**. 

**nstart**  is used with **kmeans()** function to decide how many random sets should be chosen for initial cluster assignments. When we set nstart=50, the algorithm will run for 50 different initial cluster assignments, the best one will be used for the final clustering. 



In [80]:
# Exercise #2: Create five_cluster

# your code here
set.seed(4230)
five_cluster <- kmeans(player_attributes, 5, nstart=50)


In [81]:
# Test your code in here
### BEGIN HIDDEN TEST

test_that("Check the player_attributes dataset", {
    expect_equal(round(sqrt(round(five_cluster$totss))),307)
        

    expect_equal(round(mean(five_cluster$center), 3)  ,0.269)

   
})


print("Passed! Good work!")

### END HIDDEN TEST

[1] "Passed! Good work!"


# Exercise 3:  

Based on **five_cluster** results, how many players are assigned in cluster # 3?  Count the number of player assigned to cluster # 3 and name it as **group3**.

In [82]:
five_cluster$size[3]

In [83]:
# Exercise #3: count cluster =5 in five_cluster

# your code here
group3 <- five_cluster$size[3]

In [84]:
# Test your code in here
### BEGIN HIDDEN TEST

test_that("Check the player_attributes dataset", {
    expect_equal(round((group3*0.005)^1.34, 3),4.194)
        
  #
})


print("Passed! Good Job!")

### END HIDDEN TEST

[1] "Passed! Good Job!"


# Exercise 4:
 Add a new column to **player_data** to store the cluster assignment of each player based on **five_cluster** (five_cluster$cluster). Name your column in **player_data** as **labels_5**.

In [85]:
# Exercise #4: update player_data

# your code here
player_data$labels_5 <- five_cluster$cluster[5]

In [86]:
# Test your code in here
### BEGIN HIDDEN TEST

test_that("Check the player_attributes dataset", {
    expect_equal(ncol(player_data),20)
        
 expect_equal(ncol(player_data)*nrow(player_data),117980)
        
    
    
})


print("Passed! Good Job!")

### END HIDDEN TEST

[1] "Passed! Good Job!"


In [87]:
# You can run this code to see how five_cluster labelling of players aligns with the data. 
#ggplot(player_data, aes(x = passes_z, y = goals_z, color = factor(five_cluster$cluster))) +
 # geom_point()


#boxplot(player_data$passes_z ~ five_cluster$cluster, 
 #       xlab="passes_z", ylab="Cluster", horizontal=TRUE)



With **five_cluster**, we have created five labels in the data. If we want to know which cluster best captures the best goal scorers in the dataset, we can calculate the mean **goals_z** values for each label in **labels_5**  in **player_data**.  Then, one cane conclde that the class label with the highest mean  **goals_z** value  captures the best goal scorers in the dataset. 

# Exercice 5:

Calculate the mean **goals_z** values in the **player_data** for each label stored in **labels_5** columb and store your findings in a 5 by 2 data frame, the first column stores the **labels_5** and the second column stores the average **goals_z** values in the dataset.  Name your calculation as **mean_goals_z**. 

In [88]:
# Exercise #5: calculate mean_goals_z

# your code here

mean_goals_z <- aggregate(data=player_data, goals_z ~ five_cluster$cluster, mean)


mean_goals_z

five_cluster$cluster,goals_z
<int>,<dbl>
1,-0.2213264
2,-0.2922514
3,0.3077187
4,1.7277796
5,-0.2235495


In [89]:
# Test your code in here
### BEGIN HIDDEN TEST

test_that("Check the player_attributes dataset", {
    expect_equal(round(mean(mean_goals_z[,2]), 2),0.26)
        
 expect_equal(as.integer(mean_goals_z[3,1]%*% mean_goals_z[4,2]),5)
        
    
    
})


print("Passed! Good Job!")

### END HIDDEN TEST

[1] "Passed! Good Job!"


# Quality of the partition
Selection of different level of k will give us a different class label. In cases where we do not know the true level of k, we need to use the data to learn it.  But, unfortunately, there is no clear answer to find the optimal k value. But, we can at least measure the quality of the partition by calculating the total sum of squares (TSS) explained by the partition. If  BSS and TSS stand for Between Sum of Squares and Total Sum of Squares, respectively,  we can calculate the TSS with the following formula:

  $ quality=100*\frac{BSS}{TSS}$.   Any partititon with higher TSS chas a better quality. 

# Exercise 6:

Calculate the quality of partition in **five_cluster** and save it as **quality**.
 

In [90]:
# Exercise #6: calculate TSS

# your code here

quality <- (five_cluster$betweenss / five_cluster$totss ) * 100


In [91]:
# Test your code in here
### BEGIN HIDDEN TEST

test_that("Check the player_attributes dataset", {
    expect_equal(round(sqrt(exp(quality^0.5))),28)
    
})


print("Passed! Good Job!")

### END HIDDEN TEST

[1] "Passed! Good Job!"


The following picture was downloaded from https://projects.fivethirtyeight.com/world-cup-comparisons/lionel-messi-2018/ website. It shows the most similar soccer players in the World Cup tournaments to Lionel Messi in 2018 World Cup. 
Since K-means clustering does not produce any distance measure, we will first create an aggregated performance measure and then, your task will be to sort the data to see if our cluster labeling captures some of the players in this list.

![title](messi.png)

In [92]:
# The code below creates a new column in player_data by adding up all 16 numerical scores for each player
# the name of the aggregated player score is **TotalScore**
# Run this code first

temp1<-player_data%>%
select(-player       ,-season,-team,-labels_5  )%>%
mutate(TotalScore = rowSums(.))

player_data$TotalScore<-temp1$TotalScore


# Exercise 7

Create a table by folllowing the instructions listed below and name it as **MessiLikely**. 

- Take the **player_data** and filter the rows to keep observations when **labels_5** column takes the value of 3  (Lionel Messi was put in cluster 3 for the 2018 World Cup Tournament)
- Sort the data by  **TotalScore** in descending order
- Display only **player**, **TotalScore**, **Season** and **team** columns. 
- Display only the first 20 rows

In [93]:
player_data

player,season,team,goals_z,xg_z,crosses_z,boxtouches_z,passes_z,progpasses_z,takeons_z,⋯,tackles_z,interceptions_z,clearances_z,blocks_z,aerials_z,fouls_z,fouled_z,nsxg_z,labels_5,TotalScore
<fct>,<int>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>
Cristian Pavón,2018,Argentina,-0.42,-0.55,0.08,-0.24,-0.61,-0.72,-0.05,⋯,0.30,-0.79,-0.80,-0.45,-0.23,-0.38,-0.91,-0.58,2,-6.20
Eduardo Salvio,2018,Argentina,-0.42,-0.50,-0.51,0.64,0.02,-0.46,-0.05,⋯,0.61,0.67,0.32,-0.45,0.21,-1.11,0.68,-0.03,2,0.10
Enzo Pérez,2018,Argentina,-0.42,0.11,-0.51,-0.69,0.27,-0.65,-0.36,⋯,0.35,-0.22,-0.80,-0.45,-0.89,0.72,0.04,-0.66,2,-4.50
Federico Fazio,2018,Argentina,-0.42,-0.59,-0.51,-0.69,-0.97,-0.91,-0.67,⋯,-0.82,-0.79,-0.64,-0.45,0.21,-1.11,-0.59,-0.77,2,-10.56
Franco Armani,2018,Argentina,-0.42,-0.59,-0.51,-0.80,-0.67,0.64,-0.67,⋯,-0.82,-0.79,-0.64,-0.45,-0.89,-1.11,-0.91,-0.77,2,-10.07
Gabriel Mercado,2018,Argentina,1.30,-0.30,0.08,-0.35,1.00,-0.20,-0.36,⋯,1.43,0.79,0.64,0.73,-0.01,1.81,0.68,-0.47,2,6.92
Gonzalo Higuaín,2018,Argentina,-0.42,0.25,-0.51,-0.02,-0.77,-1.10,-0.36,⋯,-0.82,-0.79,-0.64,-0.45,-0.67,-0.38,-0.59,-0.31,2,-8.25
Javier Mascherano,2018,Argentina,-0.42,-0.52,-0.51,-0.80,3.25,0.51,0.26,⋯,3.71,2.75,0.80,-0.45,-0.01,4.37,1.32,-0.31,2,15.58
Lionel Messi,2018,Argentina,1.30,3.08,0.66,1.30,1.17,0.32,6.76,⋯,2.15,-0.79,-0.80,-0.45,-0.89,0.35,4.49,4.55,2,25.32
Lucas Biglia,2018,Argentina,-0.42,-0.50,-0.51,-0.35,-0.58,-0.91,-0.67,⋯,-0.11,-0.79,-0.80,-0.45,-0.45,-1.11,-0.59,-0.69,2,-9.77


In [100]:
# Exercise #7: MessiLikely

# your code here




MessiLikely <- player_data %>%
        select(player, TotalScore, season, team ) %>%
        filter(player_data$labels_5==3) 
      
MessiLikely

player,TotalScore,season,team
<fct>,<dbl>,<int>,<fct>


In [95]:
sum(MessiLikely$TotalScore)

In [62]:
# Test your code in here
### BEGIN HIDDEN TEST

test_that("Check the player_attributes dataset", {
    expect_equal(dim(MessiLikely)[1]*dim(MessiLikely)[1],400)
    expect_equal(sum(MessiLikely$team=="Argentina"),4)
    expect_equal(sum(MessiLikely$season==1974),2)
    
    
    
})


print("Passed! Good Job!")

### END HIDDEN TEST

ERROR: Error: Test failed: 'Check the player_attributes dataset'
* <text>:5: dim(MessiLikely)[1] * dim(MessiLikely)[1] not equal to 400.
1/1 mismatches
[1] 0 - 400 == -400
* <text>:6: sum(MessiLikely$team == "Argentina") not equal to 4.
1/1 mismatches
[1] 0 - 4 == -4
* <text>:7: sum(MessiLikely$season == 1974) not equal to 2.
1/1 mismatches
[1] 0 - 2 == -2
