# Factors and Tables

## Factor Definition

• Qualitative data that can assume only a discrete number of values (i.e. categorical data) can be represented as a factor in R.

• For example, Democrat, Republican, or Independent, Male or Female, Control or Treatment, etc.

• In R, think of factors as vectors with additional information which is a record of the distinct elements of the factor, called the levels.

• R automatically treats factors specially in many functions

In [1]:
data <- rep(c("Control","Treatment"),c(3,4))
data # A character vector

In [2]:
group <- factor(data)
group

In [3]:
str(group)

 Factor w/ 2 levels "Control","Treatment": 1 1 1 2 2 2 2


In [4]:
mode(group)

In [5]:
summary(group)

## Functions on Factors
The split() function takes as input a vector and a factor (or list of factors), splitting the input according to the groups of the factor. The output is a list.

Example

Suppose that we knew the ages and sex of the members of the Control and Treatment groups

In [6]:
group

In [7]:
ages <- c(20, 30, 40, 35, 35, 35, 35)
sex <- c("M", "M", "F", "M", "F", "F", "F")

In [8]:
split(ages, list(group, sex))

## Tables
The table() function can be used to produce contingency tables in R.


In [9]:
group

In [10]:
table(group)

group
  Control Treatment 
        3         4 

In [11]:
table(sex, group)

   group
sex Control Treatment
  F       1         3
  M       2         1

In [12]:
new_table <- table(sex, group)
new_table[, "Control"]

In [13]:
new_table['M',]

In [14]:
length(group)

In [15]:
# Gives proportions
new_table/length(group)

   group
sex   Control Treatment
  F 0.1428571 0.4285714
  M 0.2857143 0.1428571

In [16]:
# Another way to generate proportions
new_table1 <- prop.table(new_table)
new_table1

   group
sex   Control Treatment
  F 0.1428571 0.4285714
  M 0.2857143 0.1428571

# Dataframes

• Use for two-dimensional tables of data.

• Like matrices (rows and columns structure) but each column can have a different mode (character, logical, numeric, ...).

• Use for data that can be represented as observations or cases (rows) on variables (columns).

• Can have row and column names.

• Use data.frame() to create dataframes in R.

• stringsAsFactors = TRUE, the default, turns character vectors into a factor variable.

• Usually set stringsAsFactors = FALSE and set factors manually.

In [17]:
Name <- c("John", "Jill", "Jacob", "Jenny")
Year <- c(1,1,2,4)
Grade <- c("B", "A+", "B-", "A")
student_data <- data.frame(Name, Year, Grade, stringsAsFactors = FALSE)

student_data

Name,Year,Grade
John,1,B
Jill,1,A+
Jacob,2,B-
Jenny,4,A


In [18]:
str(student_data)

'data.frame':	4 obs. of  3 variables:
 $ Name : chr  "John" "Jill" "Jacob" "Jenny"
 $ Year : num  1 1 2 4
 $ Grade: chr  "B" "A+" "B-" "A"


In [19]:
summary(student_data)

     Name                Year        Grade          
 Length:4           Min.   :1.0   Length:4          
 Class :character   1st Qu.:1.0   Class :character  
 Mode  :character   Median :1.5   Mode  :character  
                    Mean   :2.0                     
                    3rd Qu.:2.5                     
                    Max.   :4.0                     

In [20]:
library(datasets)
states <- data.frame(state.x77, Region = state.region, Abbr = state.abb)
head(states, 2)

Unnamed: 0,Population,Income,Illiteracy,Life.Exp,Murder,HS.Grad,Frost,Area,Region,Abbr
Alabama,3615,3624,2.1,69.05,15.1,41.3,20,50708,South,AL
Alaska,365,6315,1.5,69.31,11.3,66.7,152,566432,West,AK


## Accessing Dataframes

In [21]:
student_data

Name,Year,Grade
John,1,B
Jill,1,A+
Jacob,2,B-
Jenny,4,A


In [22]:
student_data[3:4,]

Unnamed: 0,Name,Year,Grade
3,Jacob,2,B-
4,Jenny,4,A


In [23]:
student_data$Grade
print(typeof(student_data$Grade))

[1] "character"


In [24]:
# The column data can also be accessed by using the same method in Python, however, 
# this will return a list

student_data['Grade']
print(typeof(student_data['Grade']))

Grade
B
A+
B-
A


[1] "list"


In [25]:
# Can also use rownames
states["New York", ]

Unnamed: 0,Population,Income,Illiteracy,Life.Exp,Murder,HS.Grad,Frost,Area,Region,Abbr
New York,18076,4903,1.4,70.55,10.9,52.7,82,47831,Northeast,NY


## Filtering Dataframes

In [26]:
student_data

Name,Year,Grade
John,1,B
Jill,1,A+
Jacob,2,B-
Jenny,4,A


In [27]:
student_data[student_data$Grade == "A+", ]

Unnamed: 0,Name,Year,Grade
2,Jill,1,A+


In [28]:
# Using the which() function can also achieve the same goal.
student_data[which(student_data$Grade == 'A+'),]

Unnamed: 0,Name,Year,Grade
2,Jill,1,A+


In [29]:
student_data[student_data$Year <= 2, ]

Name,Year,Grade
John,1,B
Jill,1,A+
Jacob,2,B-


In [30]:
states[states$Region == "Northeast", "Population"]

## Adding Rows and Columns to Dataframes

In [31]:
new_stu <- data.frame(Name="Bobby", Year=3, Grade="A")
student_data <- rbind(student_data, new_stu)
student_data

Name,Year,Grade
John,1,B
Jill,1,A+
Jacob,2,B-
Jenny,4,A
Bobby,3,A


In [32]:
student_data$School <- "Columbia"
student_data

Name,Year,Grade,School
John,1,B,Columbia
Jill,1,A+,Columbia
Jacob,2,B-,Columbia
Jenny,4,A,Columbia
Bobby,3,A,Columbia


In [33]:
# We can add columns to Dataframes by using the same  method as in Python.
student_data['Program'] <- 'Statistics'
student_data

Name,Year,Grade,School,Program
John,1,B,Columbia,Statistics
Jill,1,A+,Columbia,Statistics
Jacob,2,B-,Columbia,Statistics
Jenny,4,A,Columbia,Statistics
Bobby,3,A,Columbia,Statistics
