# Сравнение распределений в R

## 0\. Подготовительные работы

In [32]:
getwd()

In [33]:
data <- read.csv("data_compare_distr.csv")

In [34]:
library(jsonlite)
library(dplyr)

In [35]:
dim(data)

In [36]:
str(data)

'data.frame':	10000 obs. of  38 variables:
 $ total            : int  21331 21331 21331 21331 21331 21331 21331 21331 21331 21331 ...
 $ max.score        : num  8.9 8.9 8.9 8.9 8.9 ...
 $ index            : Factor w/ 2 levels "ad_views-2016.04.26",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ score            : num  8.9 8.9 8.9 8.9 8.9 ...
 $ ttl              : int  983745312 983643458 983652194 983839221 983849469 983238797 983424030 983967022 984002223 984438976 ...
 $ timestamp        : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
 $ types            : Factor w/ 2 levels "bn","cu": 1 1 1 1 1 1 1 1 1 1 ...
 $ site.stat        : int  6 6 6 6 6 6 6 6 6 6 ...
 $ rekl             : int  45 8 26 7 7 40 43 26 40 26 ...
 $ scheme           : int  226 87 7286 14658 14658 NA NA 7286 NA 7286 ...
 $ campaign         : int  54 87 36 85 85 NA NA 36 NA 35 ...
 $ banner           : int  62 132 37 121 123 NA NA 36 NA 34 ...
 $ stavka           : int  50 12 14 150000 150000 NA NA 24 NA 60 ...
 $ pay.for   

In [37]:
data <- data[sort(colnames(data))]

Посмотрим на классы переменных:

In [38]:
classes <- lapply(data, class)
print(as.character(classes))
table(as.character(classes))

 [1] "factor"  "integer" "factor"  "integer" "integer" "factor"  "factor" 
 [8] "integer" "factor"  "factor"  "factor"  "integer" "factor"  "factor" 
[15] "numeric" "integer" "integer" "integer" "factor"  "factor"  "numeric"
[22] "numeric" "factor"  "integer" "integer" "numeric" "integer" "integer"
[29] "numeric" "integer" "integer" "integer" "integer" "numeric" "integer"
[36] "integer" "factor"  "integer"



 factor integer numeric 
     13      19       6 

Посмотрим, какие переменные — `integer`:

In [39]:
(classes.int <- colnames(data.frame(subset(classes, classes == "integer"))))

Очевидно, что некоторые из них совсем не `integer`, а `factor`. Исправим это и сохраним названия переменных разных классов в соответствующие векторы:

In [47]:
print("Number of unique values in integer variables:")
lapply(data[classes.int], unique) %>%  lapply(length)

[1] "Number of unique values in integer variables:"


In [48]:
classes[c("adsystem", "agent", "banner", "campaign", "errors", "flash",
          "format", "rekl", "scheme", "rm.ip", "site.code", "site.stat",
          "stat.format", "worker", "pay.for")] <- "factor"

classes.int <- colnames(data.frame(subset(classes, classes == "integer")))
classes.num <- colnames(data.frame(subset(classes, classes == "numeric")))
classes.fact <- colnames(data.frame(subset(classes, classes == "factor")))

Факторам — факторово! Поменяем классы переменных там, где это нужно сделать, не забывая про `timestamp` и `X.timestamp`.

In [50]:
data[classes.fact] <- lapply(data[classes.fact], as.factor)
data$timestamp <- as.character(data$timestamp)
data$X.timestamp <- as.character(data$X.timestamp)

# final check:
# (lapply(data, class))

## 1\. Exploratory Analysis

In [59]:
head(data, n = 3)
tail(data, n = 3)

Unnamed: 0,X.timestamp,X.version,action,adsystem,agent,amuid.format,amuid.format.site,banner,browser,browser.lang,ellip.h,score,site.code,site.stat,stat.format,stavka,timestamp,total,ttl,types,worker
1,2016-04-27T01:09:57.766Z,1,,2,901,,,62,MSIE,"[{""type"":""m""",<8b>,8.901715,,6,2,50,1461719398070,21331,983745312,bn,1080
2,2016-04-27T01:08:15.852Z,1,,2,2633,,,132,Firefox,"[{""type"":""m""",<8b>,8.901715,,6,2,12,1461719296216,21331,983643458,bn,50
3,2016-04-27T01:08:24.818Z,1,,2,2622,,,37,Firefox,ru,<8b>,8.901715,,6,2,14,1461719304952,21331,983652194,bn,60


Unnamed: 0,X.timestamp,X.version,action,adsystem,agent,amuid.format,amuid.format.site,banner,browser,browser.lang,ellip.h,score,site.code,site.stat,stat.format,stavka,timestamp,total,ttl,types,worker
1,2016-04-26T10:51:24.857Z,1,,2,1206,,,164,undefined,ru,<8b>,9.211555,,6,2,136000,1461667885035,15306,932210018,bn,240
2,2016-04-26T10:51:24.665Z,1,,2,2808,,,158,Safari,ru,<8b>,9.211555,,6,2,12,1461667885059,15306,932210042,bn,50
3,2016-04-26T10:51:24.710Z,1,,2,2808,,,34,Safari,ru,<8b>,9.211555,,6,1,60,1461667885059,15306,932210042,bn,50


In [58]:
str(data)

Classes 'tbl_df', 'tbl' and 'data.frame':	10000 obs. of  38 variables:
 $ X.timestamp      : chr  "2016-04-27T01:09:57.766Z" "2016-04-27T01:08:15.852Z" "2016-04-27T01:08:24.818Z" "2016-04-27T01:11:31.788Z" ...
 $ X.version        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ action           : Factor w/ 1 level "view": NA NA NA NA NA NA NA NA NA NA ...
 $ adsystem         : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 3 3 2 3 2 ...
 $ agent            : Factor w/ 4074 levels "1","2","3","4",..: 901 2633 2622 2339 2339 884 110 2660 61 1859 ...
 $ amuid.format     : Factor w/ 2 levels "new","old": NA NA NA NA NA NA NA NA NA NA ...
 $ amuid.format.site: Factor w/ 2 levels "new","old": NA NA NA NA NA NA NA NA NA NA ...
 $ banner           : Factor w/ 219 levels "1","2","3","4",..: 62 132 37 121 123 NA NA 36 NA 34 ...
 $ browser          : Factor w/ 26 levels "360SE","Chrome",..: 5 3 3 3 3 2 4 3 4 2 ...
 $ browser.lang     : Factor w/ 4 levels "[{","[{\"type\":\"m\"",..: 2 2 4 2 2 NA NA 2 NA 2 ...
 $ brow

In [55]:
summary(data)

 X.timestamp          X.version  action     adsystem        agent     
 Length:10000       Min.   :1   view: 218   1   : 218   1416   :  65  
 Class :character   1st Qu.:1   NA's:9782   2   :7094   816    :  59  
 Mode  :character   Median :1               3   :2678   1322   :  21  
                    Mean   :1               NA's:  10   435    :  19  
                    3rd Qu.:1                           983    :  12  
                    Max.   :1                           (Other):9812  
                                                        NA's   :  12  
 amuid.format amuid.format.site     banner        browser    
 new : 209    new : 227         34     :1434   Firefox:2554  
 old :  19    old :   1         36     : 917   Chrome :2553  
 NA's:9772    NA's:9772         37     : 853   MSIE   :1923  
                                2      : 640   Opera  :1015  
                                1      : 433   Safari : 625  
                                (Other):3035   IE     : 586 