# Data Manipulation & Reshaping
This training module was developed by Dr. Kyle R. Roell, Dr. Julia E. Rager, and Alexis Payton

Spring 2023

## Introduction to Training Module

Data within the fields of exposure science, toxicology, and public health are very rarely prepared and ready for all statistical analyses/visualization code. The beginning of almost any scripted analysis includes important formatting steps. These steps largely encompass data organization, manipulation, and other steps in preparation for actual statistical analyses/visualizations. Data organization and manipulation generally refers to organizing and formatting data in a way that makes it easier to read and work with. This can be done several ways, including:

+ Base R operations and functions, or
+ A collection of packages (and philosophy) known as [Tidyverse](https://www.tidyverse.org).

In this training tutorial we will go over some of the most common ways you can organize and manipulate data, including:

+ Merging data
+ Filtering and subsetting data
+ Melting and casting data

These approaches will first be taught using the basic operations and functions available in  base R. Then, the exact same approaches will be taught using the Tidyverse package and associated functions and syntax.

These data manipulation and organization methods are demonstrated using an example environmentally relevant human cohort dataset. This cohort was generated by creating data distributions randomly pulled from our previously published cohorts, resulting in a bespoke dataset for these training purposes with associated demographic data and variable environmental exposure metrics from metal levels obtained using sources of drinking water and human urine samples.



#### Set your working directory
In preparation, first let's set our working directory to the folderpath that contains our input files

In [None]:
setwd("/filepath to where your input files are")

Note that in macOS, filepaths use "/" as folder separaters; whereas in PCs, filepaths use "\".


#### Importing example datasets

Then let's read in our example datasets

In [2]:
demographic_data <- read.csv("Module2_3/Module2_3_DemographicData.csv")
chemical_data <- read.csv("Module2_3/Module2_3_ChemicalData.csv")

#### Viewing example datasets
Let's see what these datasets look like:

In [3]:
dim(demographic_data)
dim(chemical_data)

The demographic dataset includes 200 rows x 7 columns, while the chemical measurement dataset includes 200 rows x 7 columns.

In [4]:
head(demographic_data)

Unnamed: 0_level_0,ID,BMI,MAge,MEdu,BW,GA
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<int>,<dbl>,<int>
1,1,27.7,22.99928,3,3180.058,34
2,2,26.8,30.05142,3,3210.823,43
3,3,33.2,28.0466,3,3311.551,40
4,4,30.1,34.81796,3,3266.844,32
5,5,37.4,42.6844,3,3664.088,35
6,6,33.3,24.9496,3,3328.988,40


These demographic data are organized according to subject ID (first column) followed by the following subject information:

+ **BMI** (Body Mass Index)
+ **MAge** (Maternal Age, units: years)
+ **MEdu** (Maternal Education, 1 = "less than high school"; 2 = "high school or some college"; 3 = "college or greater")
+ **BW** (Body Weight, units: grams)
+ **GA** (Gestational Age, units: weeks)

In [5]:
# The `head` function displays all the columns and the first 6 rows of a dataframe
head(chemical_data)

Unnamed: 0_level_0,ID,DWAs,DWCd,DWCr,UAs,UCd,UCr
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,6.426464,1.292941,51.67987,10.192695,0.7537104,42.60187
2,2,7.832384,1.798535,50.10409,11.815088,0.9789506,41.30757
3,3,7.516569,1.288461,48.74001,10.079057,0.1903262,36.47716
4,4,5.906656,2.075259,50.92745,8.719123,0.9364825,42.47987
5,5,7.181873,2.762643,55.16882,9.436559,1.4977829,47.78528
6,6,9.723429,3.054057,51.14812,11.589403,1.6645837,38.26386


These chemical data are organized according to subject ID (first column), followed by measures of:

+ **DWAs** (drinking water Arsenic levels in µg/L)
+ **DWCd** (drinking water Cadmium levels in µg/L)
+ **DWCr** (drinking water Chromium levels in µg/L)
+ **UAs** (urinary Arsenic levels in µg/L)
+ **UCd** (urinary Cadmium levels in µg/L)
+ **UCr** (urinary Chromium levels in µg/L)

## Training Module's Environmental Health Question
This training module was specifically developed to answer the following environmental health question: 
1. What is the average urinary Chromium concentration for each maternal education level?

We'll use base R and tiydverse to answer this question, but let's start with Base R.

## Data Manipulation using Base R

#### Merging Data using Base R Syntax
Merging datasets represents the joining together of two or more datasets, while connecting the datasets using a common identifier (generally some sort of ID). This is useful if you have multiple datasets describing different aspects of the study, different variables, or different measures across the same samples. Samples could correspond to the same study participants, animals, cell culture samples, environmental media samples, etc, depending on the study design. In the current example, we will be joining human demographic data and environmental metals exposure data collected from drinking water and human urine samples.

Let's start by merging the example demographic data with the chemical measurement data using the base R function `merge`. To learn more about this function, you can type the following:

In [9]:
?merge

which brings up helpful information in the R console. To merge these datasets using the merge function, use the following code:

In [6]:
full.data <- merge(demographic_data, chemical_data, by = "ID") 
dim(full.data) 

This merged dataframe contains 200 rows x 12 columns. Viewing this merged dataframe:

In [7]:
head(full.data)

Unnamed: 0_level_0,ID,BMI,MAge,MEdu,BW,GA,DWAs,DWCd,DWCr,UAs,UCd,UCr
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,27.7,22.99928,3,3180.058,34,6.426464,1.292941,51.67987,10.192695,0.7537104,42.60187
2,2,26.8,30.05142,3,3210.823,43,7.832384,1.798535,50.10409,11.815088,0.9789506,41.30757
3,3,33.2,28.0466,3,3311.551,40,7.516569,1.288461,48.74001,10.079057,0.1903262,36.47716
4,4,30.1,34.81796,3,3266.844,32,5.906656,2.075259,50.92745,8.719123,0.9364825,42.47987
5,5,37.4,42.6844,3,3664.088,35,7.181873,2.762643,55.16882,9.436559,1.4977829,47.78528
6,6,33.3,24.9496,3,3328.988,40,9.723429,3.054057,51.14812,11.589403,1.6645837,38.26386


We can see that the `merge` function retained the first column in each original dataframe (`ID`), though did not replicate it since it was used as the identifier to merge off of. All other columns include their original data, just merged together by the IDs in the first column.

These datasets were actually quite easy to merge, since they had the same exact column identifier and number of rows. You can edit your script to include more specifics in instances when these may differ across datasets that you would like to merge. For example:

In [8]:
full.data <- merge(demographic_data, chemical_data, by.x = "ID", by.y = "ID") 
# This option allows you to edit the column header text that is used in each 
# Here, these are still the same "ID", but you can see that adding 
# this script allows you to specify instances when differ header text is used.

# Viewing data
head(full.data)

Unnamed: 0_level_0,ID,BMI,MAge,MEdu,BW,GA,DWAs,DWCd,DWCr,UAs,UCd,UCr
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,27.7,22.99928,3,3180.058,34,6.426464,1.292941,51.67987,10.192695,0.7537104,42.60187
2,2,26.8,30.05142,3,3210.823,43,7.832384,1.798535,50.10409,11.815088,0.9789506,41.30757
3,3,33.2,28.0466,3,3311.551,40,7.516569,1.288461,48.74001,10.079057,0.1903262,36.47716
4,4,30.1,34.81796,3,3266.844,32,5.906656,2.075259,50.92745,8.719123,0.9364825,42.47987
5,5,37.4,42.6844,3,3664.088,35,7.181873,2.762643,55.16882,9.436559,1.4977829,47.78528
6,6,33.3,24.9496,3,3328.988,40,9.723429,3.054057,51.14812,11.589403,1.6645837,38.26386


#### Filtering and Subsetting Data using Base R Syntax

Filtering and subsetting data are useful tools when you need to focus your dataset to highlight data you are interested in analyzing downstream. These could represent, for example, specific samples or participants that meet certain criteria that you are interested in evaluating. It is also useful for simply removing particular variables or samples from dataframes as you are working through your script. These methods are illustrated here.

For this example, let's first define a vector of columns that we want to keep in our analysis

In [9]:
subset.columns <- c("BMI", "MAge", "MEdu")

# Subsetting the data by selecting the columns represented in the defined 
#'subset.columns' vector
subset.data1 <- full.data[,subset.columns]

# Viewing the top of this subsetted dataframe
head(subset.data1) 

Unnamed: 0_level_0,BMI,MAge,MEdu
Unnamed: 0_level_1,<dbl>,<dbl>,<int>
1,27.7,22.99928,3
2,26.8,30.05142,3
3,33.2,28.0466,3
4,30.1,34.81796,3
5,37.4,42.6844,3
6,33.3,24.9496,3


We can also easily subset data based on row numbers. For example, to keep only the first 100 rows:

In [10]:
subset.data3 <- full.data[1:100,]

# Viewing the dimensions of this new dataframe
dim(subset.data3)

To remove the first 100 rows:

In [11]:
subset.data4 <- full.data[-c(1:100),]

# Viewing the dimensions of this new dataframe
dim(subset.data4)

**Conditional statements** are also written to filter and subset data. A **conditional statement** is written to execute one block of code if the statement is true and a different block of code if the statement is false. 

A conditional statement requires a boolean or true/false statement that will be either TRUE or FALSE. A couple of the more commonly used functions used to create conditional statements include...
 - `if(){}` or an if statement means "execute R code when the condition is met".
 - `if(){} else{}` or an if/else statement means "execute R code when condition 1 is met, if not excute R code for condition 2".

There are six comparison operators that are used to created these boolean values. 
- `==` means "equals".
- `!=` means "not equal".
- `<` means "less than".
- `>` means "greater than".
- `<=` means "less than or equal to".
- `>=` mean "greater than or equal to".

There are also three logical operators that are used to create these boolean values.
- `&` means "and".
- `|` means "or".
- `!` means "not".

Filtering data based on conditions can also be done using the `subset` function:

In [12]:
# Filtering for subjects whose BMI is greater than 25 and has a college education
subset.data6 <- subset(full.data, BMI > 25 & MEdu == 3)

Additionally, we can subset and select specific columns we would like to keep, using `select` within the `subset` function:

In [13]:
# Filtering for subjects whose BMI is less than 22 or greater than 27
# Also selecting the BMI, maternal age, and maternal education columns
subset.data7 <- subset(full.data, BMI < 22 | BMI > 27, select = subset.columns)

For more information on the `subset` function, see its associated [RDocumentation website](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/subset).

#### Melting and Casting Data using Base R Syntax
Melting and casting refers to the conversion of data to "long" or "wide" form as discussed previously in [Module 1.2](link to module). You will often see data within the environmental health field in wide format, though long format is necessary for some procedures, such as plotting with [ggplot2](https://ggplot2.tidyverse.org) and conducting certain analyses.

Here, we'll illustrate some example script to melt and cast data using the [reshape2 package](https://www.rdocumentation.org/packages/reshape2/versions/1.4.4).
Let's first install and load the `reshape2` library:

In [14]:
if (!requireNamespace("reshape2"))
  install.packages("reshape2");

Loading required namespace: reshape2



In [15]:
library(reshape2)

Using the fully merged dataframe, let's remind ourselves what these data look like in the current dataframe format:

In [16]:
head(full.data)

Unnamed: 0_level_0,ID,BMI,MAge,MEdu,BW,GA,DWAs,DWCd,DWCr,UAs,UCd,UCr
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,27.7,22.99928,3,3180.058,34,6.426464,1.292941,51.67987,10.192695,0.7537104,42.60187
2,2,26.8,30.05142,3,3210.823,43,7.832384,1.798535,50.10409,11.815088,0.9789506,41.30757
3,3,33.2,28.0466,3,3311.551,40,7.516569,1.288461,48.74001,10.079057,0.1903262,36.47716
4,4,30.1,34.81796,3,3266.844,32,5.906656,2.075259,50.92745,8.719123,0.9364825,42.47987
5,5,37.4,42.6844,3,3664.088,35,7.181873,2.762643,55.16882,9.436559,1.4977829,47.78528
6,6,33.3,24.9496,3,3328.988,40,9.723429,3.054057,51.14812,11.589403,1.6645837,38.26386


These data are represented by single subject identifiers listed as unique IDs per row, with associated environmental measures and demographic data organized across the columns. Thus, this dataframe is currently in **wide (also known as casted)** format.

Let's convert this dataframe to **long (also known as melted)** format:

In [17]:
# Here, we are saying that we want a row for each unique 
# sample ID - variable measure pair
full.melted <- melt(full.data, id = "ID") 

# Viewing this new dataframe
head(full.melted) 

Unnamed: 0_level_0,ID,variable,value
Unnamed: 0_level_1,<int>,<fct>,<dbl>
1,1,BMI,27.7
2,2,BMI,26.8
3,3,BMI,33.2
4,4,BMI,30.1
5,5,BMI,37.4
6,6,BMI,33.3


You can see here that each measure that was originally contained as a unique column has been reoriented, such that the original column header is now listed throughout the second column labeled `variable`. Then, the third column contains the value of this variable.

Let's see an example view of the middle of this new dataframe:

In [18]:
full.melted[1100:1110,1:3]

Unnamed: 0_level_0,ID,variable,value
Unnamed: 0_level_1,<int>,<fct>,<dbl>
1100,100,DWAs,7.928885
1101,101,DWAs,8.677403
1102,102,DWAs,8.115183
1103,103,DWAs,7.134189
1104,104,DWAs,8.816142
1105,105,DWAs,7.487227
1106,106,DWAs,7.541973
1107,107,DWAs,6.313516
1108,108,DWAs,6.654474
1109,109,DWAs,7.564429


Here, we can see a different variable (DWAs) now being listed. This continues throughout the entire dataframe, which has the following dimensions:

In [19]:
dim(full.melted)

Thus, this dataframe is clearly melted, in long format. Let's now re-cast this dataframe back into wide format using the `dcast` function.

In [20]:
# Here, we are telling the dcast 
# function to give us a sample (ID) for every variable in the column labeled 'variable'. 
# Then it automatically fills the dataframe with values from the 'value' column
full.cast <- dcast(full.melted, ID ~ variable) 
head(full.cast)

Unnamed: 0_level_0,ID,BMI,MAge,MEdu,BW,GA,DWAs,DWCd,DWCr,UAs,UCd,UCr
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,27.7,22.99928,3,3180.058,34,6.426464,1.292941,51.67987,10.192695,0.7537104,42.60187
2,2,26.8,30.05142,3,3210.823,43,7.832384,1.798535,50.10409,11.815088,0.9789506,41.30757
3,3,33.2,28.0466,3,3311.551,40,7.516569,1.288461,48.74001,10.079057,0.1903262,36.47716
4,4,30.1,34.81796,3,3266.844,32,5.906656,2.075259,50.92745,8.719123,0.9364825,42.47987
5,5,37.4,42.6844,3,3664.088,35,7.181873,2.762643,55.16882,9.436559,1.4977829,47.78528
6,6,33.3,24.9496,3,3328.988,40,9.723429,3.054057,51.14812,11.589403,1.6645837,38.26386


Here, we can see that this dataframe is back in its original casted (or wide) format. Now that we're familiar with some base R functions to reshape our data let's answer our original question.

In [None]:
###insert how to do this in base R

## Introduction to Tidyverse

[Tidyverse](https://www.tidyverse.org) is a collection of packages that are commonly used to more efficiently organize and manipulate datasets in R. This collection of packages has its own specific type of syntax, dataset and formatting protocols that slightly differ from the Base R functions. Here, we will carry out all the of the same data organization exercises described above using Tidyverse.


#### Downloading and Loading the Tidyverse Package

If you don't have `tidyverse` already installed, you will need to install it using:

In [21]:
if(!require(tidyverse)) 
    install.packages("tidyverse")

Loading required package: tidyverse

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.0      [32m✔[39m [34mpurrr  [39m 0.3.4 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.2      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


And then load the tidyverse package using:

In [22]:
library(tidyverse)

#### Merging Data using Tidyverse Syntax

To merge the same example dataframes using tidyverse, you can run the following script:

In [23]:
full.data.tidy <- inner_join(demographic_data, chemical_data, by = "ID")

# Note, for future scripting purposes, we can still merge with different IDs 
# using: by = c("ID.Demo"="ID.Chem")
head(full.data.tidy)

Unnamed: 0_level_0,ID,BMI,MAge,MEdu,BW,GA,DWAs,DWCd,DWCr,UAs,UCd,UCr
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,27.7,22.99928,3,3180.058,34,6.426464,1.292941,51.67987,10.192695,0.7537104,42.60187
2,2,26.8,30.05142,3,3210.823,43,7.832384,1.798535,50.10409,11.815088,0.9789506,41.30757
3,3,33.2,28.0466,3,3311.551,40,7.516569,1.288461,48.74001,10.079057,0.1903262,36.47716
4,4,30.1,34.81796,3,3266.844,32,5.906656,2.075259,50.92745,8.719123,0.9364825,42.47987
5,5,37.4,42.6844,3,3664.088,35,7.181873,2.762643,55.16882,9.436559,1.4977829,47.78528
6,6,33.3,24.9496,3,3328.988,40,9.723429,3.054057,51.14812,11.589403,1.6645837,38.26386


#### Filtering and Subsetting Data using Tidyverse Syntax

To subset columns in tidyverse, run the following:

In [24]:
subset.tidy1 <- full.data.tidy %>% 
    select(all_of(subset.columns))
head(subset.tidy1)

Unnamed: 0_level_0,BMI,MAge,MEdu
Unnamed: 0_level_1,<dbl>,<dbl>,<int>
1,27.7,22.99928,3
2,26.8,30.05142,3
3,33.2,28.0466,3
4,30.1,34.81796,3
5,37.4,42.6844,3
6,33.3,24.9496,3


Note that you can also include column identifiers that may get dropped in the subsetting vector here:

In [26]:
# Note that we're including a 'fake' column here 'NotAColName' to illustrate 
# how to incorporate additional columns; though this column gets dropped in 
# the next line of code

subset.columns2 <- c(subset.columns, "NotAColName")
# Viewing this new vector
subset.columns2

subset.tidy2 <- full.data.tidy %>% select(any_of(subset.columns2))
# Viewing the top of this new dataframe
head(subset.tidy2) 

Unnamed: 0_level_0,BMI,MAge,MEdu
Unnamed: 0_level_1,<dbl>,<dbl>,<int>
1,27.7,22.99928,3
2,26.8,30.05142,3
3,33.2,28.0466,3
4,30.1,34.81796,3
5,37.4,42.6844,3
6,33.3,24.9496,3


Note that the 'fake' column `NotAColName` gets automatically dropped here


To remove columns using tidyverse, you can run the following:

In [28]:
# Removing columns
subset.tidy3 <- full.data.tidy %>% 
    select(-all_of(subset.columns))

# Viewing this new dataframe
head(subset.tidy3) 

Unnamed: 0_level_0,ID,BW,GA,DWAs,DWCd,DWCr,UAs,UCd,UCr
Unnamed: 0_level_1,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,3180.058,34,6.426464,1.292941,51.67987,10.192695,0.7537104,42.60187
2,2,3210.823,43,7.832384,1.798535,50.10409,11.815088,0.9789506,41.30757
3,3,3311.551,40,7.516569,1.288461,48.74001,10.079057,0.1903262,36.47716
4,4,3266.844,32,5.906656,2.075259,50.92745,8.719123,0.9364825,42.47987
5,5,3664.088,35,7.181873,2.762643,55.16882,9.436559,1.4977829,47.78528
6,6,3328.988,40,9.723429,3.054057,51.14812,11.589403,1.6645837,38.26386


Subsetting rows using tidyverse:

In [29]:
# Selecting to retain only the first 100 rows
subset.tidy4 <- full.data.tidy %>% 
    slice(1:100) 

dim(subset.tidy4)

In [30]:
# Selecting to remove the first 100 rows
subset.tidy5 <- full.data.tidy %>% 
    slice(-c(1:100))

dim(subset.tidy5)

Filtering data based on conditional statements using tidyverse:

In [31]:
subset.tidy6 <- full.data.tidy %>% 
    filter(BMI > 25 & MAge > 31)

dim(subset.tidy6)

Another example of a conditional statement that can be used to filter data:

In [32]:
subset.tidy7 <- full.data.tidy %>% 
    filter(BMI > 25 & MAge > 31) %>% 
    select(BMI, MAge, MEdu)

#### Melting and Casting Data using Tidyverse Syntax
To melt and cast data in tidyverse, you can use the `pivot` functions (i.e., `pivot_longer` or `pivot_wider`). These are exemplified below.

Melting to long format using tidyverse:

In [33]:
full.pivotlong <- full.data.tidy %>% 
    pivot_longer(-ID, names_to = "var", values_to = "value")

head(full.pivotlong, 15)

ID,var,value
<int>,<chr>,<dbl>
1,BMI,27.7
1,MAge,22.999283
1,MEdu,3.0
1,BW,3180.058132
1,GA,34.0
1,DWAs,6.4264644
1,DWCd,1.2929409
1,DWCr,51.6798741
1,UAs,10.1926949
1,UCd,0.7537104


Casting to wide format using tidyverse:

In [34]:
full.pivotwide <- full.pivotlong %>% 
    pivot_wider(names_from = "var", values_from = "value")
head(full.pivotwide)

ID,BMI,MAge,MEdu,BW,GA,DWAs,DWCd,DWCr,UAs,UCd,UCr
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,27.7,22.99928,3,3180.058,34,6.426464,1.292941,51.67987,10.192695,0.7537104,42.60187
2,26.8,30.05142,3,3210.823,43,7.832384,1.798535,50.10409,11.815088,0.9789506,41.30757
3,33.2,28.0466,3,3311.551,40,7.516569,1.288461,48.74001,10.079057,0.1903262,36.47716
4,30.1,34.81796,3,3266.844,32,5.906656,2.075259,50.92745,8.719123,0.9364825,42.47987
5,37.4,42.6844,3,3664.088,35,7.181873,2.762643,55.16882,9.436559,1.4977829,47.78528
6,33.3,24.9496,3,3328.988,40,9.723429,3.054057,51.14812,11.589403,1.6645837,38.26386


Now that we're familiar with some tidyverse functions to reshape our data let's answer our original question.

In [36]:
full.data %>%
# using the `group_by` function to group our dataset within each `MEdu` class
    group_by(MEdu) %>%
# using the `summarize` function to find the mean of urinary Chromium levels
    summarize(Average_Urinary_Chromium_Concentration = mean(UCr))

MEdu,Average_Urinary_Chromium_Concentration
<int>,<dbl>
1,39.88055
2,40.61807
3,40.41556


## Concluding Remarks
Together, this training module provides introductory level information on the basics of data organization in R. The important data organization / manipulation methods of merging, filtering, subsetting, melting, and casted are presented on an environmentally relevant dataset. Check out Posit's [Dplyr Cheat Sheet](https://posit.co/resources/cheatsheets/?type=posit-cheatsheets&_page=2/) for more documentation on manipulating tables in R. (`Dplyr` is a package that automatically comes with the `tidyverse` package that specifically allows for data manipulation). You might need it for the next section!


## Test your knowledge
1. What subjects, arranged from highest to lowest drinking water Cadmium levels, had babies at at least 35 weeks and had urinary Cadmium levels of at least 1.5 µg/L?

**Hint**: Try using the `arrange` function from tidyverse.