**Overview**

This kernel attempts to compare the distribution of values of select features between the train and test dataset. We will utilize the Kolmogorov-Smirnov test to determine if there is any significant difference in the distribution of feature columns.

In [None]:
rm(list = ls())
require(tidyverse)
require(data.table)
require(Matching)

In [None]:
pub.train <- fread('../input/train_V2.csv', header = T, stringsAsFactors = T)
pub.test <- fread('../input/test_V2.csv', header = T, stringsAsFactors = T)

**Data Cleansing**

We will only select numeric columns for this test. 

In [None]:
train <- pub.train[,c(-1,-2,-3,-16,-25)]
test <- pub.test[,c(-1,-2,-3,-16)]


We will also extract the column names for our use later.

In [None]:
# Get names
feat_names <- names(train)


To save on kernel runtime, we will only get a sample set from both train and test.

In [None]:
set.seed(1)
train.sample <- sample(1:nrow(train), 100, replace = F)
test.sample <- sample(1:nrow(test), 100, replace = F)

train.samp <- train[train.sample,]
test.samp <- test[test.sample,]



Next, we will convert to the dataframes to matrices for our analysis.

In [None]:
train.mat <- as.matrix(train.samp)
test.mat <- as.matrix(test.samp)


**Kolmogorv-Smirnov Test**

We will reiterate the Kolmogorov-Smirnov function across all selected features and combine the results into a dataframe. For this test, we will use the `ks.boot` function in the `Matching` package.

In [None]:
mat.pval <- matrix(1:24, nrow=24)

for (i in 1:24){

ks <- ks.boot(train.mat[,i],test.mat[,i])
p.val <- ks$ks.boot.pvalue

mat.pval[i] <- p.val
  }


In [None]:
options(repr.plot.width=6, repr.plot.height=5)
mat.pval <- as.data.frame(mat.pval)

df.pval <- data.frame(feat_names,(mat.pval))
names(df.pval) <- c('Features','KS_pval')

df.pval %>% arrange(KS_pval)

df.pval %>% ggplot(aes(reorder(Features, -KS_pval), KS_pval)) + 
  geom_bar(stat='identity', aes(fill=KS_pval)) + 
  coord_flip() + scale_fill_gradient() + xlab('Features') + ylab('p-value') + 
  ggtitle('p-values of Kolmogorov-Smirnov Test for Train and Test Data')


Based on the sample, selected features from the train and test data follow the same distribution.  Note, however, that the rank of features by p-values 
depends on the sample size. Feel free to adjust the sample size for this kernel to check the p-values.