Skip to content

Excel to R how do I...

JTecwyn edited this page Dec 14, 2021 · 9 revisions

A barrier to transitioning from using Excel to using R is that an experienced Excel user may wonder "how do I do that simple Excel task (e.g. summing a column) in R?"

The purpose of this page is to explain how to do Excel tasks in R. We focus on methods that can be applied to the R data frame, using the dplyr package for data manipulations.

dplyr logic focuses around the "pipe" %>% with shortcut ctrl+shift+m. Essentially everything on the left-hand side of %>% gets "piped" into the next argument. This avoids having long lines of code defining new outputs based on previous outputs.

R comes in with a built-in dataset called "iris". We include examples applied to this dataset in the table below so that you can recreate them in your local area.

REMEMBER: R is case sensitive, so all references to column names/entries need to be as-is in the dataset you are looking at. Functions exist that can translate all your columns to lower or snake case for ease!

How to do common Excel tasks in R:

Common Excel Task How to do in R dataframe (with dplyr) Example with iris dataset
List unique entries in field (column) unique(dfname$fieldname)

or if factors, can do: levels(dfname$fieldname)
Find the unique entries for the "Species" column in iris

iris %>% select(Species) %>% distinct()
Filter/select based on criteria filter(dfname, fieldname == value)

filter(dfname, grepl("search_term",field_name)) #filters based on containing string

filter(dfname, (URN == "141006" | URN == "138262" | URN == "141164")) # example of filtering with OR
Filter for sepal length >4 and sepal width <2.5, but NOT "versicolor" species

iris %>% filter(Sepal.Length > 4 & Sepal.Width <2.5 & Species != "versicolor")
Select specific columns select(colname1, colname2,….) Select only species and petal length

iris %>% select(Species, Petal.Length)
If else with OR e.g. mutate(var_name = ifelse(ks2_type_1718 == "AC" | ks2_type_1718 == "ACC",1,0)) Create new column called "size_group" based on size of petal. "Large" petals are >4 in length or >1.5 in width, everything else is "Small"

iris %>% mutate(size_group = if_else( Petal.Length > 4 | Petal.Width >1.5, "Large", "Small"))
Multiple if else Create new column called "flower_price". "Setosa" species with petal length > 3 are "top band", "Versicolor species with petal length < 4 are "low band", everything else is "mid band"

iris %>% mutate(flower_price = case_when(Species == "setosa" & Petal.Length > 1.5 ~"top band", Species == "versicolor" & Petal.Length < 4 ~"low_band", TRUE ~ "mid_band"))
CountIFs (count of occurences of each entry in column) dfname %>% count(fieldname) Count the number of species if they have a petal length >1.5

iris %>% filter(Petal.Length > 1.5 ) %>% group_by(Species) %>% count()
Sum a column sum(dfname$fieldname)
SUMIF Sum petal width of species if sepal width <3

iris %>% filter(Sepal.Width <3) %>% group_by(Species) %>% summarise(Petal.Width = sum(Petal.Width))
VLOOKUP Lookup to a table called "lookup" which has a list of species

iris %>% left_join(lookup, by =c("Species"="plant_species")
Arrange / sort in order arrange(fieldname)

arrange(match(band,c("IL","OL","F","R")) # specific order for fieldname band
Order dataset by descending petal width

iris %>% arrange(desc(Petal.Width))
Maximum of data values in a given row pmax(fieldname1,fieldname2,fieldname3)
Pivot pivot_longer() "lengthens" data, increasing the number of rows and decreasing the number of columns.

The inverse transformation is pivot_wider().

e.g earnings_wide_alt = earnings_model_outputs %>% pivot_wider(x_master_ano_id, names_from = "YEARS_SINCE_SRDD", values_from = "UK_EARNINGS") #
Rename columns rename(new_field_name = "current_field_name") # current field name must be in quote marks
Convert data to numeric format e.g. transform(fieldname = as.numeric(fieldname))
Bind rows from different dataframes bind_rows, rbind
e.g. df1 <- df1 %>% bind_rows(df2)
note that fields must have same names in order to "bind"
Concatenate / Split e.g.
unite(LAcom_URN, LA_com, URN, sep = "_", remove = FALSE)

separate(LAcom_URN, into = c('LA_com','URN'), sep="_", remove = FALSE)
group_by with summarise group_by(domicile, gender) %>% summarise(count = sum(count))

Other useful tasks / functions in R:

Task How to do in R dataframe (with dplyr)
Find dimensions (size) of dataset dim(dataset)
list types for each attribute sapply(dataset, class)

str(df_name)
Replace NA with zero replace(., is.na(.), 0)
Drop all rows with NA in particular column drop_na(fieldname)
Merge/join (note: you can use left_join / right_join etc instead of merge) Inner join: merge(x = df1, y = df2, by = "CustomerId") #Return only the rows in which the left table have matching keys in the right table.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE) #Returns all rows from both tables, join records from the left which have matching keys in the right table.

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE) #Return all rows from the left table, and any rows with matching keys from the right table.

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE) #Return all rows from the right table, and any rows with matching keys from the left table.

Cross join: merge(x = df1, y = df2, by = NULL)
Read from/Write to CSV e.g.read_csv("school_estate/data/CMM.csv"), write_csv(df_name, "school_estate/data/CMM.csv")
Read from/Write to RDs e.g.borrower_status_table=readRDS("W:/HEA-SLR-TEAM/Earnings model development/Datasets for models/2021-08-05_borrower_status_table.RDs"), saveRDS(borrower_status_table,paste0(historic_earnings_outputs_filepath,"/",date,"_borrower_status_table.Rds"))

Useful other links / resources to consider:

Presentation by Jon Tecwyn on some of the most common dplyr "verbs": dplyr_basic_verbs.pptx

data.table and dplyr tour: https://atrebas.github.io/post/2019-03-03-datatable-dplyr/

A short list of basic, commonly-used Excel formulas and their (base) R counterparts. https://www.rforexcelusers.com/excel-r-function-formula-list/

David Sands' Coffee & Coding on SQL/Excel to R: https://github.com/dfe-analytical-services/coffee-and-coding/tree/master/20181121_sql-and-excel-to-r