# Сравнение распределений в R

## 0\. Подготовительные работы

In [1]:
getwd()

In [2]:
data <- read.csv("data_compare_distr.csv")

In [18]:
library(jsonlite)
library(dplyr)
library(ggplot2)

In [4]:
dim(data)

In [5]:
str(data)

'data.frame':	10000 obs. of  36 variables:
 $ X.timestamp      : Factor w/ 9800 levels "2016-04-26T09:01:54.584Z",..: 5307 5300 5301 5316 5321 5288 5294 5324 5328 5342 ...
 $ action           : Factor w/ 1 level "view": NA NA NA NA NA NA NA NA NA NA ...
 $ adsystem         : int  2 2 2 2 2 3 3 2 3 2 ...
 $ agent            : int  901 2633 2622 2339 2339 884 110 2660 61 1859 ...
 $ amuid.format     : Factor w/ 2 levels "new","old": NA NA NA NA NA NA NA NA NA NA ...
 $ amuid.format.site: Factor w/ 2 levels "new","old": NA NA NA NA NA NA NA NA NA NA ...
 $ banner           : int  62 132 37 121 123 NA NA 36 NA 34 ...
 $ browser          : Factor w/ 26 levels "360SE","Chrome",..: 5 3 3 3 3 2 4 3 4 2 ...
 $ browser.lang     : Factor w/ 4 levels "[{","[{\"type\":\"m\"",..: 2 2 4 2 2 NA NA 2 NA 2 ...
 $ browser.version  : Factor w/ 541 levels "0","1.1.1.0(29.0.1547.62)",..: NA NA NA NA NA 60 527 NA 527 NA ...
 $ campaign         : int  54 87 36 85 85 NA NA 36 NA 35 ...
 $ city             : Fa

In [6]:
data <- data[sort(colnames(data))]

Посмотрим на классы переменных:

In [7]:
classes <- lapply(data, class)
print(as.character(classes))
table(as.character(classes))

 [1] "factor"  "factor"  "integer" "integer" "factor"  "factor"  "integer"
 [8] "factor"  "factor"  "factor"  "integer" "factor"  "factor"  "numeric"
[15] "integer" "integer" "integer" "factor"  "numeric" "numeric" "factor" 
[22] "integer" "integer" "numeric" "integer" "integer" "numeric" "integer"
[29] "integer" "integer" "integer" "numeric" "integer" "integer" "factor" 
[36] "integer"



 factor integer numeric 
     12      18       6 

Посмотрим, какие переменные — `integer`:

In [8]:
(classes.int <- colnames(data.frame(subset(classes, classes == "integer"))))

Очевидно, что некоторые из них совсем не `integer`, а `factor`. Исправим это и сохраним названия переменных разных классов в соответствующие векторы:

In [9]:
print("Number of unique values in integer variables:")
lapply(data[classes.int], unique) %>%  lapply(length)

[1] "Number of unique values in integer variables:"


In [10]:
classes[c("adsystem", "agent", "banner", "campaign", "errors", "flash",
          "format", "rekl", "scheme", "rm.ip", "site.code", "site.stat",
          "stat.format", "worker", "pay.for")] <- "factor"

classes.int <- colnames(data.frame(subset(classes, classes == "integer")))
classes.num <- colnames(data.frame(subset(classes, classes == "numeric")))
classes.fact <- colnames(data.frame(subset(classes, classes == "factor")))

Факторам — факторово! Поменяем классы переменных там, где это нужно сделать, не забывая про `timestamp` и `X.timestamp`.

In [11]:
data[classes.fact] <- lapply(data[classes.fact], as.factor)
data$timestamp <- as.character(data$timestamp)
data$X.timestamp <- as.character(data$X.timestamp)

# final check:
# (lapply(data, class))

## 1\. Exploratory Analysis

In [12]:
head(data, n = 3)
tail(data, n = 3)

Unnamed: 0,X.timestamp,action,adsystem,agent,amuid.format,amuid.format.site,banner,browser,browser.lang,browser.version,ellip.h,score,site.code,site.stat,stat.format,stavka,timestamp,total,ttl,types,worker
1,2016-04-27T01:09:57.766Z,,2,901,,,62,MSIE,"[{""type"":""m""",,<8b>,8.901715,,6,2,50,1461719398070,21331,983745312,bn,1080
2,2016-04-27T01:08:15.852Z,,2,2633,,,132,Firefox,"[{""type"":""m""",,<8b>,8.901715,,6,2,12,1461719296216,21331,983643458,bn,50
3,2016-04-27T01:08:24.818Z,,2,2622,,,37,Firefox,ru,,<8b>,8.901715,,6,2,14,1461719304952,21331,983652194,bn,60


Unnamed: 0,X.timestamp,action,adsystem,agent,amuid.format,amuid.format.site,banner,browser,browser.lang,browser.version,ellip.h,score,site.code,site.stat,stat.format,stavka,timestamp,total,ttl,types,worker
9998,2016-04-26T10:51:24.857Z,,2,1206,,,164,undefined,ru,,<8b>,9.211555,,6,2,136000,1461667885035,15306,932210018,bn,240
9999,2016-04-26T10:51:24.665Z,,2,2808,,,158,Safari,ru,,<8b>,9.211555,,6,2,12,1461667885059,15306,932210042,bn,50
10000,2016-04-26T10:51:24.710Z,,2,2808,,,34,Safari,ru,,<8b>,9.211555,,6,1,60,1461667885059,15306,932210042,bn,50


In [15]:
# str(data)

In [14]:
summary(data)

 X.timestamp         action     adsystem        agent      amuid.format
 Length:10000       view: 218   1   : 218   1416   :  65   new : 209   
 Class :character   NA's:9782   2   :7094   816    :  59   old :  19   
 Mode  :character               3   :2678   1322   :  21   NA's:9772   
                                NA's:  10   435    :  19               
                                            983    :  12               
                                            (Other):9812               
                                            NA's   :  12               
 amuid.format.site     banner        browser           browser.lang 
 new : 227         34     :1434   Firefox:2554   [{          :  73  
 old :   1         36     : 917   Chrome :2553   [{"type":"m":1806  
 NA's:9772         37     : 853   MSIE   :1923   en          :   5  
                   2      : 640   Opera  :1015   ru          :5437  
                   1      : 433   Safari : 625   NA's        :2679  
          

In [20]:
qplot(cpm, data=data, )

In is.na(data$y): is.na() applied to non-(list or vector) of type 'NULL'

ERROR: Error in seq.default(from = best$lmin, to = best$lmax, by = best$lstep): 'from' must be of length 1


ERROR: Error in file(con, "rb"): cannot open the connection


ERROR: Error in file(con, "rb"): cannot open the connection


plot without title