# R | Factors

In R, factors are a data type used to represent categorical variables. They play a crucial role in statistical analysis and data modeling, allowing the efficient handling of data with discrete categories or levels. Factors are created using the `factor()` function, which takes a vector of data and its distinct categories as input. Each unique value in the vector is assigned a level, and the factor retains the order and labels of these levels.

Factors are beneficial in various aspects of data analysis, such as:

1.  **Statistical Modeling:** Factors are essential in statistical models, as they enable the representation of qualitative or nominal variables. They are commonly used in regression, ANOVA, and other analyses.
    
2.  **Memory Efficiency:** Factors store data more efficiently than character vectors, as they only store the unique levels once, with integer indices pointing to these levels. This optimization reduces memory usage, especially when dealing with large datasets.
    
3.  **Ordered Categorical Data:** Factors can handle ordered categories, making them suitable for data with predefined rankings or levels.
    
4.  **Visualization:** Factors work seamlessly with plotting functions, allowing easy visualization of categorical data.
    
5.  **Consistent Subsetting and Sorting:** Factors maintain the order of levels, ensuring consistent subsetting and sorting of data based on categories.
    
At the end of the [last lab](./Lab-03-dataframes.ipynb) on data frames, we learned that character data loaded into a data frame is converted into a special data structure called a factor by default. 

While factors are incredibly useful for categorical data, it is essential to be cautious when performing mathematical operations with them, as they are internally stored as integers. Converting factors to character vectors or vice versa should be done carefully to avoid unintended consequences. Overall, mastering the usage of factors in R is fundamental for efficient data handling and accurate statistical analyses.



In [1]:
gender_vector <- c(rep("male",10),
                   rep("female",15))   # Create a character variable

gender_factor <- factor(gender_vector) # Convert to factor
 
print(gender_factor)

 [1] male   male   male   male   male   male   male   male   male   male  
[11] female female female female female female female female female female
[21] female female female female female
Levels: female male


You can specify the levels a factor can take by passing a character vector of levels to the levels argument.

In [2]:
gender_factor <- factor(gender_vector, 
                        levels = c("male","female","other"))

print(gender_factor)

 [1] male   male   male   male   male   male   male   male   male   male  
[11] female female female female female female female female female female
[21] female female female female female
Levels: male female other


In this case there are no data points that take on the level "other" but the factor allows for the possibility of encountering the category "other".

You can check, rename and add to the levels of a factor with the levels() function:

In [3]:
levels(gender_factor)    # Check levels

In [4]:
levels(gender_factor) <- c("male","female","unknown")  # Change levels

levels(gender_factor)

In [5]:
levels(gender_factor) <- c("male","female",
                           "unknown","no_response") # Add a level

levels(gender_factor)

You can remove factor levels with no data present by recreating the factor with the factor() function or by using the droplevels() function.

In [6]:
gender_factor <- droplevels(gender_factor) # drop unused levels

levels(gender_factor)

In R, there exists another type of factor known as an "ordered factor," specifically designed for handling ordinal data. Ordinal data refers to non-numeric information that possesses a natural ordering. For instance, a variable with levels like "very low," "low," "medium," "high," and "very high" is non-numeric but exhibits a clear order. To represent such data accurately, it can be encoded as an ordered factor.

To create an ordered factor in R, you can utilize the `factor()` function with the additional argument `ordered=TRUE`, or alternatively, employ the `ordered()` function.

_Important Note: When creating an ordered factor, it is crucial to specify the `levels` argument. The levels provided are used to define the ordering from the lowest to the highest category._

In [7]:
dat <- rep(c("very low", "low", "medium", "high", "very high"), 5)

dat_factor <- factor(dat, 
                     levels=c("very low", "low", "medium", "high", "very high"),
                     ordered=TRUE)

print(dat_factor)

 [1] very low  low       medium    high      very high very low  low      
 [8] medium    high      very high very low  low       medium    high     
[15] very high very low  low       medium    high      very high very low 
[22] low       medium    high      very high
Levels: very low < low < medium < high < very high


While factors can be useful for storing categorical data during analysis, they usually not what you want if you need to clean, alter or otherwise process the data prior to analysis. Convert a factor to character using as.character().

In [8]:
as.character(gender_factor)

If you try to convert a factor to numeric, the result will be a numeric vector corresponding to the integers assigned to each factor level.

In [9]:
as.numeric(gender_factor)

If for some reason you have numeric data encoded as a factor, this might not be the desired behavior. To convert a factor with numeric levels to a numeric vector of the level values use the following construction.

In [10]:
numeric_factor <- factor(c(-1.3, -2.6, 2.6, 3.2, 2.6, 4.5, -1.3))

# This construction lets you extract the numbers
as.numeric(levels(numeric_factor))[numeric_factor]

# Converting to character first also works (but may run slower)
as.numeric(as.character(numeric_factor))

If you'd like to add more values to an existing factor, you can't just use c() like you would when combining normal vectors. One way to add to a factor is to convert the factor to character, concatenate it with the new values, and then convert it back to factor.

In [11]:
# This adds more values to the gender_factor

gender_factor <- as.factor(c(as.character(gender_factor), 
                "Unknown", "Unknown", "Prefer not to say"))

summary(gender_factor)

# Factor Indexing

Factor indexing is a data manipulation technique commonly used in statistical programming languages like R. A factor is a categorical data type that represents discrete values with a predefined set of levels or categories. Indexing a factor allows accessing specific subsets of data based on the categories or levels assigned to each element in the factor.

For instance, if we have a factor variable called "weather" with levels "sunny," "cloudy," and "rainy," factor indexing enables us to extract all the data corresponding to a particular weather condition efficiently. It provides a convenient way to filter, subset, and analyze data based on distinct categories, making it a powerful tool for data exploration and analysis tasks involving categorical variables.

Since factors are essentially vectors with each value being an integer paired with a character that specifies the name of the level, factor indexing works the same as vector indexing.

In [12]:
gender_factor[2]                      # Get the second element
gender_factor[9:15]                   # Get a slice of elements
gender_factor[c(3,6,12)]              # Get a selection of specific elements
gender_factor[gender_factor=="male"]  # Get all values where the level equals male

# Factor Summary Functions

Factor summary functions are statistical tools used to analyze and summarize categorical data represented as factors in R programming. In R, a factor is a data type used to store categorical data, such as group names, gender, or educational levels. Factor summary functions enable users to compute various statistics and insights specific to each category or level within the factor.

These functions allow users to perform operations such as calculating the frequency or count of occurrences for each category, computing proportions or percentages, obtaining summary statistics like mean, median, and standard deviation for each group, and generating cross-tabulations to observe relationships between multiple factors.

Factor summary functions are valuable for gaining a deeper understanding of the distribution and characteristics of categorical data, making them essential tools in data exploration and analysis tasks in R.

In addition to levels(), factors support several other summary functions.

In [14]:
summary(gender_factor)  # summary() returns counts for each level

In [15]:
str(gender_factor)     # str() shows the factor's stucture

 Factor w/ 4 levels "Prefer not to say",..: 4 4 4 4 4 4 4 4 4 4 ...


In [16]:
length(gender_factor)  # Get the length of the factor

In [17]:
table(gender_factor)   # table() creates a data table of counts

gender_factor
Prefer not to say           Unknown            female              male 
                1                 2                15                10 

# Wrap Up

Factors and ordered factors are valuable in R because numerous statistical, predictive modeling, and graphing functions are designed to treat them as categorical variables. However, manipulating factors can be challenging, which is why it is recommended to work with regular atomic data, such as character vectors, during data cleaning and related tasks.

When loading well-structured data into an R data frame, converting characters to factors can be advantageous. On the other hand, if you are dealing with messy data or data with an unknown structure, it might be better to keep text data in character format and then convert specific columns to factors at a later stage if needed.

## Exercises

To do the exercises, fill in and run the code boxes according to the exercise instructions.

### Exercise #1
Convert the saved vector below into a factor called "language_factor."

In [18]:
language <- c(rep("python",15),rep("R",10),rep("SQL",5))

language_factor <- "Your Code Here!"
language_factor

### Exercise #2
Generate a summary of language_factor.

In [19]:
"Your Code Here!"

### Exercise #3
Add add 3 instances of "Julia" to the language_factor and then create another summary. (Remember to add to a factor, you can convert it to character, use c() and then convert it back to a factor.).

In [20]:
"Your Code Here!"

## Exercise Solutions

In [21]:
# 1 

language <- c(rep("python",15),rep("R",10),rep("SQL",5))

language_factor <- factor(language)
language_factor

# 2 

summary(language_factor)

# 3

language_factor <- factor( c(as.character(language_factor), rep("Julia",3)) )

summary(language_factor)