# Сравнение распределений в R

## 0\. Подготовительные работы

In [1]:
getwd()

In [20]:
data <- read.csv("data_compare_distr.csv")

In [30]:
library(jsonlite)
library(dplyr)
library(ggplot2)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [21]:
dim(data)

In [22]:
str(data)

'data.frame':	33357 obs. of  36 variables:
 $ site             : int  3 3 3 3 3 3 3 3 3 3 ...
 $ is.bot           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ time             : Factor w/ 20889 levels "2016-04-26 00:00:48",..: 20354 20349 20352 20384 20385 20369 20373 20375 20414 20404 ...
 $ total            : int  1077 1077 1077 1077 1077 1077 1077 1077 1077 1077 ...
 $ max.score        : num  12.2 12.2 12.2 12.2 12.2 ...
 $ score            : num  12.2 12.2 12.2 12.2 12.2 ...
 $ types            : Factor w/ 2 levels "bn","cu": 1 1 1 1 1 1 1 1 1 1 ...
 $ site.stat        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ campaign         : int  350 350 350 350 350 347 343 350 349 263 ...
 $ format           : Factor w/ 7 levels "bn1","bn2","bn20",..: 5 5 5 5 2 6 6 5 6 5 ...
 $ master           : int  3 3 3 3 3 3 3 3 3 3 ...
 $ uid              : int  NA NA NA NA NA NA NA NA NA NA ...
 $ os               : Factor w/ 35 levels "Android","AndroidPhone",..: 11 10 11 10 8 11 10 11 6 16 ...
 $ browser          : Fact

In [23]:
data <- data[sort(colnames(data))]

Посмотрим на классы переменных:

In [31]:
classes <- lapply(data, class)
print(as.character(classes))
table(as.character(classes))

 [1] "factor"  "integer" "integer" "factor"  "factor"  "factor"  "integer"
 [8] "factor"  "factor"  "factor"  "factor"  "integer" "factor"  "factor" 
[15] "numeric" "integer" "factor"  "integer" "integer" "integer" "integer"
[22] "numeric" "numeric" "factor"  "integer" "integer" "numeric" "integer"
[29] "numeric" "integer" "integer" "integer" "factor"  "integer" "factor" 
[36] "integer"



 factor integer numeric 
     14      17       5 

Посмотрим, какие переменные — `integer`:

In [32]:
(classes.int <- colnames(data.frame(subset(classes, classes == "integer"))))

Очевидно, что некоторые из них совсем не `integer`, а `factor`. Исправим это и сохраним названия переменных разных классов в соответствующие векторы:

In [33]:
print("Number of unique values in integer variables:")
lapply(data[classes.int], unique) %>%  lapply(length)

[1] "Number of unique values in integer variables:"


In [37]:
classes[c("adsystem", "agent", "banner", "campaign", "errors", "flash",
          "format", "rekl", "scheme", "rm.ip", "site.code", "site.stat",
          "stat.format", "worker", "pay.for")] <- "factor"

classes.int <- colnames(data.frame(subset(data, classes == "integer")))
classes.num <- colnames(data.frame(subset(data, classes == "numeric")))
classes.fact <- colnames(data.frame(subset(data, classes == "factor")))

Факторам — факторово! Поменяем классы переменных там, где это нужно сделать, не забывая про `timestamp` и `X.timestamp`.

In [39]:
data[classes.fact] <- lapply(data[classes.fact], as.factor)

In [None]:
data$time
# data$timestamp <- as.character(data$timestamp)
# data$X.timestamp <- as.character(data$X.timestamp)

# final check:
# (lapply(data, class))

## 1\. Exploratory Analysis

In [12]:
head(data, n = 3)
tail(data, n = 3)

Unnamed: 0,X.timestamp,action,adsystem,agent,amuid.format,amuid.format.site,banner,browser,browser.lang,browser.version,ellip.h,score,site.code,site.stat,stat.format,stavka,timestamp,total,ttl,types,worker
1,2016-04-27T01:09:57.766Z,,2,901,,,62,MSIE,"[{""type"":""m""",,<8b>,8.901715,,6,2,50,1461719398070,21331,983745312,bn,1080
2,2016-04-27T01:08:15.852Z,,2,2633,,,132,Firefox,"[{""type"":""m""",,<8b>,8.901715,,6,2,12,1461719296216,21331,983643458,bn,50
3,2016-04-27T01:08:24.818Z,,2,2622,,,37,Firefox,ru,,<8b>,8.901715,,6,2,14,1461719304952,21331,983652194,bn,60


Unnamed: 0,X.timestamp,action,adsystem,agent,amuid.format,amuid.format.site,banner,browser,browser.lang,browser.version,ellip.h,score,site.code,site.stat,stat.format,stavka,timestamp,total,ttl,types,worker
9998,2016-04-26T10:51:24.857Z,,2,1206,,,164,undefined,ru,,<8b>,9.211555,,6,2,136000,1461667885035,15306,932210018,bn,240
9999,2016-04-26T10:51:24.665Z,,2,2808,,,158,Safari,ru,,<8b>,9.211555,,6,2,12,1461667885059,15306,932210042,bn,50
10000,2016-04-26T10:51:24.710Z,,2,2808,,,34,Safari,ru,,<8b>,9.211555,,6,1,60,1461667885059,15306,932210042,bn,50


In [15]:
# str(data)

In [14]:
summary(data)

 X.timestamp         action     adsystem        agent      amuid.format
 Length:10000       view: 218   1   : 218   1416   :  65   new : 209   
 Class :character   NA's:9782   2   :7094   816    :  59   old :  19   
 Mode  :character               3   :2678   1322   :  21   NA's:9772   
                                NA's:  10   435    :  19               
                                            983    :  12               
                                            (Other):9812               
                                            NA's   :  12               
 amuid.format.site     banner        browser           browser.lang 
 new : 227         34     :1434   Firefox:2554   [{          :  73  
 old :   1         36     : 917   Chrome :2553   [{"type":"m":1806  
 NA's:9772         37     : 853   MSIE   :1923   en          :   5  
                   2      : 640   Opera  :1015   ru          :5437  
                   1      : 433   Safari : 625   NA's        :2679  
          

In [20]:
qplot(cpm, data=data, )

In is.na(data$y): is.na() applied to non-(list or vector) of type 'NULL'

ERROR: Error in seq.default(from = best$lmin, to = best$lmax, by = best$lstep): 'from' must be of length 1


ERROR: Error in file(con, "rb"): cannot open the connection


ERROR: Error in file(con, "rb"): cannot open the connection


plot without title