***Factors in R:***

- For nominal data
- Hybrid of _int_ and _chr_ vector


In [78]:
set.seed(1337)
f <- factor(sample(LETTERS, 30, replace = TRUE))

In [79]:
str(f)
summary(f)
f

 Factor w/ 17 levels "C","D","E","G",..: 7 14 13 12 11 13 16 16 14 2 ...


In [80]:
as.numeric(f)

In [81]:
as.character(f)

In [82]:
levels(f)

In [83]:
nlevels(f) == length(levels(f))

In [84]:
f[f == "C"] <- "X"
f

In [85]:
(f <- droplevels(f))

In [86]:
(levels(f) <- tolower(levels(f)))

To improve understanding of factors, here's how the are formed:

`levels <- sort(unique(x))`

`f <- match(x, levels)`

`levels(f) <- as.character(levels)`

`class(f) <- "factor"`

As one can see, levels are sorted unique values of variable __x__. The factor represents a position of a value in terms of levels.

To merge factors and levels together:

`levels(x)[1:2] <- levels(x)[1]`

Here, first two levels are substituted by the first one

In memory, factors are stored as integers: 

`(storage.mode(factor()) == "integer")`

In [87]:
levels(f)[1] <- 'ababa'
f

***ORDERED FACTORS***

- for ordinal variables
- created using 'ordered' function or 'ordered = T' argument for 'factor' function

In [88]:
grad <- c("burning cold", "cold", "neutral", "warm", "hot", "burning hot")
ft <- ordered(sample(grad, 14, replace = TRUE), grad)
ft[ft > "warm"]

In [89]:
ft2 <- factor(sample(seq(1:30), 20, replace = FALSE), ordered = TRUE)
ft2

***'Cut' function***
- breaks _numeric_ vectors into intervals, thus transforming quantitative data into a nominal data
- use 'table' to count number of elements in a given interval

In [90]:
round(rnorm(10), 2); cut(rnorm(10), -5:5); table(cut(rnorm(10), -5:5))


(-5,-4] (-4,-3] (-3,-2] (-2,-1]  (-1,0]   (0,1]   (1,2]   (2,3]   (3,4]   (4,5] 
      0       0       0       1       0       6       3       0       0       0 

***Tapply***

- apply function for factors
- factors are always encountered in data frames representing some nominal values
- frequency count is always a task in the case of factor variables

In [91]:
str(warpbreaks)

'data.frame':	54 obs. of  3 variables:
 $ breaks : num  26 30 54 25 70 52 51 26 67 18 ...
 $ wool   : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
 $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...


In [92]:
tapply(warpbreaks$breaks, warpbreaks$wool, max)

***Homework***

In [93]:
df <- quakes
str(quakes);summary(quakes$mag)
# 2 ways
table(cut(quakes$mag, seq(min(quakes$mag), max(quakes$mag)+0.5, by = 0.5), right = F))
table(cut(quakes$mag, seq(4, by = 0.5, length = 6), right = F))

'data.frame':	1000 obs. of  5 variables:
 $ lat     : num  -20.4 -20.6 -26 -18 -20.4 ...
 $ long    : num  182 181 184 182 182 ...
 $ depth   : int  562 650 42 626 649 195 82 194 211 622 ...
 $ mag     : num  4.8 4.2 5.4 4.1 4 4 4.8 4.4 4.7 4.3 ...
 $ stations: int  41 15 43 19 11 12 43 15 35 19 ...


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   4.00    4.30    4.60    4.62    4.90    6.40 


[4,4.5) [4.5,5) [5,5.5) [5.5,6) [6,6.5) 
    377     425     160      33       5 


[4,4.5) [4.5,5) [5,5.5) [5.5,6) [6,6.5) 
    377     425     160      33       5 

In [94]:
#Descriptive statistics and analytics with avianhabitat.csv

avian <- read.csv("R_stuff/Stepik R basics/avianHabitat.csv")
str(avian)
head(avian)
summary(avian)

'data.frame':	1070 obs. of  17 variables:
 $ Site    : chr  "BunkerHill27" "BunkerHill27" "BunkerHill27" "BunkerHill27" ...
 $ Observer: chr  "RA" "RA" "RA" "RA" ...
 $ Subpoint: int  1 2 3 4 5 6 7 8 9 10 ...
 $ VOR     : num  6 4.5 2 2.5 4 2 5.5 4 3.5 3.5 ...
 $ PDB     : int  3 2 4 3 4 3 3 2 2 2 ...
 $ DBHt    : num  5.2 3.1 5.5 6.2 5.4 4 5.2 4.4 5.7 4.8 ...
 $ PW      : int  0 3 1 0 0 0 2 1 1 0 ...
 $ WHt     : num  0 4.7 5.8 0 0 0 6.3 4.1 5.7 0 ...
 $ PE      : int  4 3 3 3 3 3 2 2 2 1 ...
 $ EHt     : num  2.9 4.1 3.9 4 3.5 4.1 2.6 4.3 5.2 1.7 ...
 $ PA      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ AHt     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PH      : int  4 3 3 4 4 2 4 5 4 5 ...
 $ HHt     : num  3 3.5 7.5 5 3.7 3.5 5.8 8.2 6.9 5.7 ...
 $ PL      : int  0 2 0 0 0 0 0 0 0 0 ...
 $ LHt     : num  0 1 0 0 0 0 0 0 0 0 ...
 $ PB      : int  0 0 0 0 0 0 0 0 0 0 ...


Unnamed: 0_level_0,Site,Observer,Subpoint,VOR,PDB,DBHt,PW,WHt,PE,EHt,PA,AHt,PH,HHt,PL,LHt,PB
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<int>,<dbl>,<int>,<dbl>,<int>,<dbl>,<int>,<dbl>,<int>,<dbl>,<int>,<dbl>,<int>
1,BunkerHill27,RA,1,6.0,3,5.2,0,0.0,4,2.9,0,0,4,3.0,0,0,0
2,BunkerHill27,RA,2,4.5,2,3.1,3,4.7,3,4.1,0,0,3,3.5,2,1,0
3,BunkerHill27,RA,3,2.0,4,5.5,1,5.8,3,3.9,0,0,3,7.5,0,0,0
4,BunkerHill27,RA,4,2.5,3,6.2,0,0.0,3,4.0,0,0,4,5.0,0,0,0
5,BunkerHill27,RA,5,4.0,4,5.4,0,0.0,3,3.5,0,0,4,3.7,0,0,0
6,BunkerHill27,JT,6,2.0,3,4.0,0,0.0,3,4.1,0,0,2,3.5,0,0,0


     Site             Observer            Subpoint           VOR        
 Length:1070        Length:1070        Min.   : 1.000   Min.   : 0.000  
 Class :character   Class :character   1st Qu.: 3.000   1st Qu.: 0.000  
 Mode  :character   Mode  :character   Median : 6.000   Median : 1.000  
                                       Mean   : 5.921   Mean   : 1.203  
                                       3rd Qu.: 8.000   3rd Qu.: 1.500  
                                       Max.   :15.000   Max.   :19.000  
      PDB              DBHt               PW             WHt        
 Min.   :0.0000   Min.   : 0.0000   Min.   :0.000   Min.   : 0.000  
 1st Qu.:0.0000   1st Qu.: 0.0000   1st Qu.:0.000   1st Qu.: 0.000  
 Median :0.0000   Median : 0.0000   Median :1.000   Median : 0.400  
 Mean   :0.8682   Mean   : 0.7827   Mean   :1.151   Mean   : 1.027  
 3rd Qu.:2.0000   3rd Qu.: 1.2000   3rd Qu.:2.000   3rd Qu.: 1.100  
 Max.   :5.0000   Max.   :10.0000   Max.   :6.000   Max.   :24.500  
      

In [95]:
any(!complete.cases(avian))
any(avian$PDB < 0)
any(avian$PDB > 100)

In [96]:
avian$Observer <- as.factor(avian$Observer)
str(avian)

'data.frame':	1070 obs. of  17 variables:
 $ Site    : chr  "BunkerHill27" "BunkerHill27" "BunkerHill27" "BunkerHill27" ...
 $ Observer: Factor w/ 3 levels "JT","RA","RR": 2 2 2 2 2 1 1 1 1 1 ...
 $ Subpoint: int  1 2 3 4 5 6 7 8 9 10 ...
 $ VOR     : num  6 4.5 2 2.5 4 2 5.5 4 3.5 3.5 ...
 $ PDB     : int  3 2 4 3 4 3 3 2 2 2 ...
 $ DBHt    : num  5.2 3.1 5.5 6.2 5.4 4 5.2 4.4 5.7 4.8 ...
 $ PW      : int  0 3 1 0 0 0 2 1 1 0 ...
 $ WHt     : num  0 4.7 5.8 0 0 0 6.3 4.1 5.7 0 ...
 $ PE      : int  4 3 3 3 3 3 2 2 2 1 ...
 $ EHt     : num  2.9 4.1 3.9 4 3.5 4.1 2.6 4.3 5.2 1.7 ...
 $ PA      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ AHt     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PH      : int  4 3 3 4 4 2 4 5 4 5 ...
 $ HHt     : num  3 3.5 7.5 5 3.7 3.5 5.8 8.2 6.9 5.7 ...
 $ PL      : int  0 2 0 0 0 0 0 0 0 0 ...
 $ LHt     : num  0 1 0 0 0 0 0 0 0 0 ...
 $ PB      : int  0 0 0 0 0 0 0 0 0 0 ...


In [97]:
check_percent_range <- function(x) {
    any (x < 0 | x > 100)
}
check_percent_range(avian$VOR)

In [98]:
# Using RE to yield coverage variables
library(stringr)
coverage_variables <- names(avian)[str_detect(names(avian), "^P")]
avian$total_coverage <- rowSums(avian[, coverage_variables])
str(avian$total_coverage)
head(avian, 5)

 num [1:1070] 11 13 11 10 11 8 11 10 9 8 ...


Unnamed: 0_level_0,Site,Observer,Subpoint,VOR,PDB,DBHt,PW,WHt,PE,EHt,PA,AHt,PH,HHt,PL,LHt,PB,total_coverage
Unnamed: 0_level_1,<chr>,<fct>,<int>,<dbl>,<int>,<dbl>,<int>,<dbl>,<int>,<dbl>,<int>,<dbl>,<int>,<dbl>,<int>,<dbl>,<int>,<dbl>
1,BunkerHill27,RA,1,6.0,3,5.2,0,0.0,4,2.9,0,0,4,3.0,0,0,0,11
2,BunkerHill27,RA,2,4.5,2,3.1,3,4.7,3,4.1,0,0,3,3.5,2,1,0,13
3,BunkerHill27,RA,3,2.0,4,5.5,1,5.8,3,3.9,0,0,3,7.5,0,0,0,11
4,BunkerHill27,RA,4,2.5,3,6.2,0,0.0,3,4.0,0,0,4,5.0,0,0,0,10
5,BunkerHill27,RA,5,4.0,4,5.4,0,0.0,3,3.5,0,0,4,3.7,0,0,0,11


In [99]:
sapply(coverage_variables, function(name) check_percent_range(avian[[name]]))

__Yielding Geolocation from Site column, evaluating the site with lowest total coverage of all the given species__

In [100]:
avian$Site_name <- factor(str_replace(avian$Site, "[:digit:]+", ""))
str(avian$Site_name)
tapply(avian$total_coverage, avian$Site_name, mean)
sort(tapply(avian$total_coverage, avian$Site_name, mean))

 Factor w/ 5 levels "BunkerHill","CreteCreek",..: 1 1 1 1 1 1 1 1 1 1 ...


#### Homework 2


In [113]:
height_variables <- names(avian)[str_ends(names(avian), "Ht")]

In [115]:
sapply(avian[, height_variables], function(x) tapply(x, avian$Observer, max))

Unnamed: 0,DBHt,WHt,EHt,AHt,HHt,LHt
JT,9.9,24.5,5.3,31.5,8.2,0.8
RA,10.0,18.5,4.9,19.2,7.5,1.3
RR,5.0,22.0,4.2,0.2,7.3,1.1
