Skip to content

Latest commit

 

History

History
223 lines (175 loc) · 6.28 KB

tidyr.md

File metadata and controls

223 lines (175 loc) · 6.28 KB
layout element title language
page
notes
tidyr
SQL

Remember the basic rules of tidy data structure

  1. One column per type of information
  2. One row per observation
  3. One value in each cell
  • Unfortunately lots of existing data doesn't follow these rules
  • Need to convert them to this tidy structure for analysis
  • Use a package called tidyr
install.packages("tidyr")
library(tidyr)
library(dplyr)

Pivot data from wide to long

  • One common issue is data spread over multiple columns that should be in one

Copy link to Western Ghats tree data from datasets page

raw_data = read.csv("http://datacarpentry.org/semester-biology/data/Macroplot_data_Rev.txt", sep = "\t")

View data

  • Data on tree girth from the Western Ghats
  • When a tree had multiple stems the diameter of each stem was entered in a separate column
  • What would a better structure be?

Lead discussion to correct structure

  • To convert the raw data into clean data we'll use a data cleaning pipeline like the data manipulation pipelines we've developed previously
  • Let's start by adding a treeid column to our data frame using the mutate function from dplyr
  • We want one treeid for each row because there is one tree for each row
clean_data <- raw_data %>%
  mutate(treeid = 1:nrow(raw_data))
  • To get the data in this form we can use pivot_longer
    • Removes redundant columns
    • Arguments:
      • data.frame
      • Columns to include (or not include)
      • names_to: the name of the new column to put the column names in
      • values_to: the name of the new column to put the column values in
clean_data <- raw_data %>%
  mutate(treeid = 1:nrow(raw_data)) %>%
  pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth")
  • The colon specifies all columns starting at TreeGirth1 and ending at TreeGirth5

View data

  • Still has zeros for where there were no stems, so filter these out
clean_data <- raw_data %>%
  mutate(treeid = 1:nrow(raw_data)) %>%
  pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth") %>%
  filter(girth != 0)

Extract

  • Want stem column to contain numbers 1-5 not TreeGirth1
  • extract()
    • Extracts one or more values from a column
    • Uses regular expressions
    • Arguments:
      • data.frame
      • Column name
      • Names of the new columns
      • Regular expression
clean_data <- raw_data %>%
  mutate(treeid = 1:nrow(raw_data)) %>%
  pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth") %>%
  filter(girth != 0) %>%
  extract(stem, 'stem', 'TreeGirth(.)')
  • TreeGirth. means the word "TreeGirth" followed by a single value

  • The () indicate what part of this to extract, so just the number at the end

  • This gives us the result we want, with just the stem number in the stem column

  • But you may notice that this number is on the left side of the column, not the right

  • That's because the number is still stored as a character, because it was extracted from a string

  • To convert it to it's actual type we can add the optional argument convert = TRUE to `extract

clean_data <- raw_data %>%
  mutate(treeid = 1:nrow(raw_data)) %>%
  pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth") %>%
  filter(girth != 0) %>%
  extract(stem, 'stem', 'TreeGirth(.)', convert = TRUE)
  • This attempts to convert the values from characters to their actual type
  • This is a good addition when extracting numbers because then you can work with the column as numbers

Separate

  • Genus and species information are combined in a single column
  • separate()
    • Separates multiple values in single column
    • Arguments:
      • data.frame
      • Column name
      • New column names
      • Separator value, character, or position
clean_data <- raw_data %>%
  pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth") %>%
  filter(girth != 0) %>%
  extract(stem, 'stem', 'TreeGirth(.)') %>%
  separate(SpCode, c('genus', 'species'), 4)

Unite and Pivot Wider

  • Sometimes we need to go in the other direction
  • Count the number of stems of each species on each plot
stem_counts <- clean_data %>% 
  group_by(PlotID, genus, species) %>% 
  summarize(count = n())
  • Software for running analysis requires cross-tab (or wide) data
  • Site in rows, species in columns, counts in cells
  • First need a single species ID
  • unite
    • Combine values from multiple columns into one
    • Arguments:
      • data.frame
      • New column name
      • Columns to combine
stem_counts_wide <- stem_counts %>% 
  unite(species_id, genus, species)
  • Then make the data wide
  • pivot_wider
    • Spread values across multiple columns
    • Arguments:
      • data.frame
      • Name of column to use for wide columns
      • Name of column containing the values for the cells
      • Optional fill argument for what to put in empty cells
stem_counts_wide <- stem_counts %>% 
  unite(species_id, genus, species) %>%
  pivot_wider(names_from = species_id, values_from = count)
  • This leaves null values when there is no value in the starting table
  • But we can replace this with something else using values_fill
stem_counts_wide <- stem_counts %>% 
  unite(species_id, genus, species) %>%
  pivot_wider(names_from = species_id,
              values_from = count,
              values_fill = list(count = 0))

Completing data with gaps

  • Some write out a value once and then leave the following rows blank
gappy_data <- read.csv("http://www.datacarpentry.org/semester-biology/data/gappy-data.csv")
gappy_data
  • This works well for humans, but not for computers
  • Can fill in these gaps using fill
clean_data <- gappy_data %>%
  fill(Species)
  • Fills down by default, but other directions are possible

  • Often data only includes observed values, but we need to list other values

  • Missing zeros or NA's

clean_data <- gappy_data %>%
  fill(Species) %>%
  complete(Species, Individual)
  • Could also use this to add zeros to our long stem_counts data frame
stem_counts <- clean_data %>% 
  group_by(PlotID, genus, species) %>% 
  summarize(count = n()) %>%
  complete(PlotID, nesting(genus, species), fill = list(count = 0))