In [None]:
library(tidyverse) # metapackage of all tidyverse packages
library(ggcorrplot)
library(grid) # package for arranging plots into a grid
library(gridExtra)
library(caret) # package for confusion matrix
library(boot) # package for K-FoldCV and boostrap

In [None]:
dataset = read_csv("../input/performance-prediction/summary.csv")
head(dataset,10)

In [None]:
summary(dataset)

In [None]:
str(dataset)

# EAD

Let's find the most representative variables through EAD first and then construct a statistical model.

### Data Wrangling

I will be removing the FreeThrowPercent, 3PointPercent, FieldGoalPercent since they are just combinations of previous columns of the dataset, and also the Name column. 

In [None]:
dataset$Target = factor(ifelse(dataset$Target==1,"Above5Years","Less5Years")) # factor the Target column
dataset = dataset %>% select(-FreeThrowPercent,-`3PointPercent`,-FieldGoalPercent,-Name)

### Boxplot

Since these variables are in different scales I will be divinding in various subplots insted of appling a log transormation on the y-ax.

In [None]:
library(grid)
options(repr.plot.width = 20, repr.plot.height = 20)
plot_boxplot = function(columns=vector()){ #simple function to produce plots
    dataset %>% select(names(dataset[,columns]),Target) %>% 
    gather(key = k1,value = VariableValue,-"Target") %>%
    ggplot(aes(y = VariableValue,x = Target,fill = Target)) +
    stat_boxplot(aes(fill = Target)) + facet_grid(.~k1) +
    theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        strip.text.x = element_text(size = 10, colour = "black", angle = 0)) +
    scale_fill_manual(values = c('#e7298a','#66a61e'))
}

box1 = plot_boxplot(columns = c(11:17))
box2 = plot_boxplot(columns = c(2,3))
box3 = plot_boxplot(columns = c(4,5))
box4 = plot_boxplot(columns = c(6:10))

grid.arrange(arrangeGrob(box1,box2,box3,box4,ncol=2,nrow=2))

As common sense would point out older players apparently do peforme better than younger ones on average.

### Correlogram

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10)
cnr = cor(dataset%>% select(-Target))
p.values = cor_pmat(dataset%>% select(-Target)) # p-values matrix
ggcorrplot(cnr, hc.order = TRUE, type = "lower",
   outline.col = "black",
   ggtheme = ggplot2::theme_gray,
   colors = c("#6D9EC1", "white", "#E46726"),p.mat=p.values,lab = TRUE)

As you can see some variables are not statistically coorelated with others given their p-values are too high, as highlighted by the X mark. With that in mind I'm going to remove the 3PointMade and 3PointAttempt also because they are not highly coorelated with any other variables.

In [None]:
dataset = dataset %>% select(-`3PointMade`,-`3PointAttempt`)

# Model

## Basic Logit Model

In [None]:
shuffel.rows = sample(nrow(dataset)*0.9) # row shuffeling
dataset_train = dataset[shuffel.rows,]
dataset_test = dataset[-shuffel.rows,]

logit.fit = glm(Target~., data=dataset_train,family='binomial')
summary(logit.fit)

In [None]:
logit.probs = predict(logit.fit,newdata = dataset_test,type = 'response')
class.pred = factor(ifelse(logit.probs>0.5,"Above5Years","Less5Years"))
confusionMatrix(class.pred,dataset_test$Target)

So the model is not that good, but could use some improvement

## K-fold Cross-Validation

In [None]:
train.control = trainControl(method = "cv", number = 15)
logit.fitKfold = train(Target ~., data = dataset_train, method = "glm",
                       trControl = train.control)
logit.fitKfold

In [None]:
probs.Kfold = predict(logit.fitKfold, newdata = dataset_test, type = "raw")
confusionMatrix(probs.Kfold,dataset_test$Target)

It appers that the model have improved on the test set.
If there are any errors please do let me know.