# Getting Started with Data Wrangling

Data wrangling is the process of cleaning, transforming, and preparing raw data for analysis. In this notebook, we demonstrate essential data wrangling techniques in R using popular packages like `dplyr` and `tidyr`. Tasks include subsetting data, renaming columns, handling missing values, performing group-based summaries, and reshaping datasets. These steps are critical for ensuring data quality and extracting meaningful insights from real-world datasets.


## Install Packages and Libraries

This code installs and loads the **tidyverse**, a collection of R packages designed for data science. The `tidyverse` includes popular tools like `dplyr` for data manipulation, `ggplot2` for visualization, and `readr` for reading data, among others.

* `install.packages("tidyverse")` installs the `tidyverse` package from CRAN. This step only needs to be done once per system.
* `library(tidyverse)` loads the package into the current R session, making its functions available for use.


In [1]:
install.packages("tidyverse")

library(tidyverse)


The downloaded binary packages are in
	/var/folders/2h/84wxzls579b1yv00g4jj02fh0000gn/T//RtmpODztsP/downloaded_packages


-- [1mAttaching core tidyverse packages[22m ------------------------ tidyverse 2.0.0 --
[32mv[39m [34mdplyr    [39m 1.1.4     [32mv[39m [34mreadr    [39m 2.1.5
[32mv[39m [34mforcats  [39m 1.0.0     [32mv[39m [34mstringr  [39m 1.5.1
[32mv[39m [34mggplot2  [39m 3.5.2     [32mv[39m [34mtibble   [39m 3.3.0
[32mv[39m [34mlubridate[39m 1.9.4     [32mv[39m [34mtidyr    [39m 1.3.1
[32mv[39m [34mpurrr    [39m 1.0.4     
-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mi[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


## Download Dataset from the URL

This line reads in a dataset from an online source and stores it as a data frame called `NHANES`.

* `read_csv()` is a `tidyverse` function from the `readr` package that reads comma-separated values (CSV) files into R as a **tibble**, a modern version of a data frame.
* The dataset is being pulled directly from a GitHub repository via its raw URL:
  `'https://raw.githubusercontent.com/GTPB/PSLS20/master/data/NHANES.csv'`
* The dataset `NHANES` likely contains health and demographic data from the National Health and Nutrition Examination Survey.


In [2]:
NHANES <- read_csv('https://raw.githubusercontent.com/GTPB/PSLS20/master/data/NHANES.csv')

[1mRows: [22m[34m10000[39m [1mColumns: [22m[34m76[39m
[36m--[39m [1mColumn specification[22m [36m--------------------------------------------------------[39m
[1mDelimiter:[22m ","
[31mchr[39m (31): SurveyYr, Gender, AgeDecade, Race1, Race3, Education, MaritalStatu...
[32mdbl[39m (45): ID, Age, AgeMonths, HHIncomeMid, Poverty, HomeRooms, Weight, Lengt...

[36mi[39m Use `spec()` to retrieve the full column specification for this data.
[36mi[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


## Data Exploration

This code examines the structure and contents of the `NHANES` dataset:

* `spec(NHANES)` displays the **guessed column specifications**, showing how R interpreted each column (e.g., character, numeric, double). This helps verify that the data types for each column were correctly detected during import.
* `head(NHANES)` displays the **first six rows** of the dataset, giving a quick preview of the data to check its structure and contents.


In [3]:
# Using "spec" to find the guessed column specifications and column names
spec(NHANES)
head(NHANES)

cols(
  ID = [32mcol_double()[39m,
  SurveyYr = [31mcol_character()[39m,
  Gender = [31mcol_character()[39m,
  Age = [32mcol_double()[39m,
  AgeDecade = [31mcol_character()[39m,
  AgeMonths = [32mcol_double()[39m,
  Race1 = [31mcol_character()[39m,
  Race3 = [31mcol_character()[39m,
  Education = [31mcol_character()[39m,
  MaritalStatus = [31mcol_character()[39m,
  HHIncome = [31mcol_character()[39m,
  HHIncomeMid = [32mcol_double()[39m,
  Poverty = [32mcol_double()[39m,
  HomeRooms = [32mcol_double()[39m,
  HomeOwn = [31mcol_character()[39m,
  Work = [31mcol_character()[39m,
  Weight = [32mcol_double()[39m,
  Length = [32mcol_double()[39m,
  HeadCirc = [32mcol_double()[39m,
  Height = [32mcol_double()[39m,
  BMI = [32mcol_double()[39m,
  BMICatUnder20yrs = [31mcol_character()[39m,
  BMI_WHO = [31mcol_character()[39m,
  Pulse = [32mcol_double()[39m,
  BPSysAve = [32mcol_double()[39m,
  BPDiaAve = [32mcol_double()[39m,
  BPSys1 = [32m

ID,SurveyYr,Gender,Age,AgeDecade,AgeMonths,Race1,Race3,Education,MaritalStatus,...,RegularMarij,AgeRegMarij,HardDrugs,SexEver,SexAge,SexNumPartnLife,SexNumPartYear,SameSex,SexOrientation,PregnantNow
<dbl>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,...,<chr>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>
51624,2009_10,male,34,30-39,409,White,,High School,Married,...,No,,Yes,Yes,16.0,8.0,1.0,No,Heterosexual,
51624,2009_10,male,34,30-39,409,White,,High School,Married,...,No,,Yes,Yes,16.0,8.0,1.0,No,Heterosexual,
51624,2009_10,male,34,30-39,409,White,,High School,Married,...,No,,Yes,Yes,16.0,8.0,1.0,No,Heterosexual,
51625,2009_10,male,4,0-9,49,Other,,,,...,,,,,,,,,,
51630,2009_10,female,49,40-49,596,White,,Some College,LivePartner,...,No,,Yes,Yes,12.0,10.0,1.0,Yes,Heterosexual,
51638,2009_10,male,9,0-9,115,White,,,,...,,,,,,,,,,


This code retrieves the column and row names of the `NHANES` tibble:

* `colnames(NHANES)` returns a vector containing the names of all the columns in the dataset.
* `rownames(NHANES)` returns the row names, but since `NHANES` is a **tibble** (a modern data frame), row names are usually not set and this will typically return `NULL`.

Tibbles generally avoid using row names, favoring row numbers instead.


In [4]:
# find column and row names of the tibble
colnames(NHANES)
rownames(NHANES)


This code extracts a subset of columns from the `NHANES` dataset and renames one of them:

* The `%>%` pipe operator passes the `NHANES` dataset into the `select()` function from `dplyr`.
* `select(ID, Age, BMI, BloodPressure = BPSysAve, SurveyYr, Gender)`:

  * Extracts the columns: `ID`, `Age`, `BMI`, `BPSysAve`, `SurveyYr`, and `Gender`.
  * Renames the column `BPSysAve` to `BloodPressure` while subsetting.

The resulting simplified dataset is stored in a new tibble called `nhanes`.

* `head(nhanes)` displays the first six rows of the new subset to confirm the selection and renaming.


In [5]:
# Extract only the columns Age, BMI, and BPSysAve, SurveyYr, and Gender
#renamed BPSysAve as BloodPressure while subsetting data
nhanes <- NHANES %>% select(ID, Age, BMI, BloodPressure = BPSysAve, SurveyYr, Gender)
head(nhanes)


ID,Age,BMI,BloodPressure,SurveyYr,Gender
<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
51624,34,32.22,113.0,2009_10,male
51624,34,32.22,113.0,2009_10,male
51624,34,32.22,113.0,2009_10,male
51625,4,15.3,,2009_10,male
51630,49,30.57,112.0,2009_10,female
51638,9,16.82,86.0,2009_10,male


This code renames multiple columns in the `nhanes` dataset using the `rename()` function from `dplyr`:

* The `%>%` pipe passes the `nhanes` tibble into the `rename()` function.
* Inside `rename()`:

  * `BP = BloodPressure` renames the `BloodPressure` column to `BP`.
  * `Age_yr = Age` renames the `Age` column to `Age_yr`.

The updated dataset with the new column names is saved back into `nhanes`.

* `head(nhanes)` displays the first six rows to confirm the column name changes.


In [6]:
#rename multiple columns of the dataset after subsetting data -- using 'rename' function from dplyr
nhanes <- nhanes %>% 
  rename(
     BP = BloodPressure,
     Age_yr = Age
  )
head(nhanes)

ID,Age_yr,BMI,BP,SurveyYr,Gender
<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
51624,34,32.22,113.0,2009_10,male
51624,34,32.22,113.0,2009_10,male
51624,34,32.22,113.0,2009_10,male
51625,4,15.3,,2009_10,male
51630,49,30.57,112.0,2009_10,female
51638,9,16.82,86.0,2009_10,male


This code demonstrates different ways to access columns in the `nhanes` tibble:

* `nhanes$ID` — Accesses the `ID` column directly as a vector. This is one of the most common and convenient ways to extract a single column.
* `nhanes['ID']` — Returns the `ID` column as a tibble with one column. By default, only the first 10 rows are displayed. To see more, you can use `print(nhanes['ID'], n = ...)`.
* `nhanes[['ID']]` — Extracts the `ID` column as a vector, similar to `nhanes$ID`, but uses double brackets. This method is especially useful when accessing columns programmatically using variables.


In [7]:
# Ways to access columns
nhanes$ID   # same as nhanes[['ID']]
print(nhanes['ID'])  # prints first 10 values, use print(colname, n=...) to see more rows
nhanes[['ID']]

[90m# A tibble: 10,000 x 1[39m
      ID
   [3m[90m<dbl>[39m[23m
[90m 1[39m [4m5[24m[4m1[24m624
[90m 2[39m [4m5[24m[4m1[24m624
[90m 3[39m [4m5[24m[4m1[24m624
[90m 4[39m [4m5[24m[4m1[24m625
[90m 5[39m [4m5[24m[4m1[24m630
[90m 6[39m [4m5[24m[4m1[24m638
[90m 7[39m [4m5[24m[4m1[24m646
[90m 8[39m [4m5[24m[4m1[24m647
[90m 9[39m [4m5[24m[4m1[24m647
[90m10[39m [4m5[24m[4m1[24m647
[90m# i 9,990 more rows[39m


This code shows different ways to access specific elements, rows, or columns of the `nhanes` dataset using standard indexing:

* `nhanes[2, "Gender"]` — Retrieves the value from **row 2, column "Gender"**, returning a single element.
* `nhanes[ , c("SurveyYr", "Gender")]` — Selects all rows but only the `"SurveyYr"` and `"Gender"` columns.
* `nhanes[1:6, ]` — Selects **rows 1 through 6** and all columns.
* `nhanes[ , ]` — Returns the entire dataset. Including empty brackets like this means "all rows" and "all columns."

This indexing style works similarly to data frames in base R but is fully compatible with tibbles.


In [8]:
# Accessing elements of dataset 
nhanes[2, "Gender"]  # a single element
nhanes[ , c("SurveyYr", "Gender")] # all rows, two columns
nhanes[1:6, ]             # rows 1-6, all columns
nhanes[ , ]               # everything

Gender
<chr>
male


SurveyYr,Gender
<chr>,<chr>
2009_10,male
2009_10,male
2009_10,male
2009_10,male
2009_10,female
2009_10,male
2009_10,male
2009_10,female
2009_10,female
2009_10,female


ID,Age_yr,BMI,BP,SurveyYr,Gender
<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
51624,34,32.22,113.0,2009_10,male
51624,34,32.22,113.0,2009_10,male
51624,34,32.22,113.0,2009_10,male
51625,4,15.3,,2009_10,male
51630,49,30.57,112.0,2009_10,female
51638,9,16.82,86.0,2009_10,male


ID,Age_yr,BMI,BP,SurveyYr,Gender
<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
51624,34,32.22,113,2009_10,male
51624,34,32.22,113,2009_10,male
51624,34,32.22,113,2009_10,male
51625,4,15.30,,2009_10,male
51630,49,30.57,112,2009_10,female
51638,9,16.82,86,2009_10,male
51646,8,20.64,107,2009_10,male
51647,45,27.24,118,2009_10,female
51647,45,27.24,118,2009_10,female
51647,45,27.24,118,2009_10,female


This code converts the data type of the `ID` column in the `nhanes` dataset:

* `as.character(nhanes$ID)` converts the `ID` column from its current type (`double`, meaning numeric) to a character (text) type.
* The result is assigned back to `nhanes$ID`, replacing the original column with its character version.

This is often done when ID numbers should be treated as text rather than numerical values for calculations.


In [9]:
# Change the column type of "ID" from "double" to "character"
nhanes$ID <- as.character(nhanes$ID)

This code inspects the `nhanes` dataset after modifying the `ID` column:

* `head(nhanes)` displays the first six rows to quickly review the current state of the data.
* `dim(nhanes)` returns the **dimensions** of the dataset as a numeric vector, showing the total number of rows and columns (`rows × columns`).
* `str(nhanes)` provides a **compact structure summary** of the dataset, detailing each column’s name, data type, and a preview of its values.

These commands help confirm that the data looks correct and to understand its overall structure.


In [10]:
# Display the first few rows of the subset after converting column 'ID' to character
head(nhanes)
dim(nhanes) #dimension of the data (Total_rows x Total_columns)
str(nhanes) # Details of the data by columns

ID,Age_yr,BMI,BP,SurveyYr,Gender
<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
51624,34,32.22,113.0,2009_10,male
51624,34,32.22,113.0,2009_10,male
51624,34,32.22,113.0,2009_10,male
51625,4,15.3,,2009_10,male
51630,49,30.57,112.0,2009_10,female
51638,9,16.82,86.0,2009_10,male


tibble [10,000 x 6] (S3: tbl_df/tbl/data.frame)
 $ ID      : chr [1:10000] "51624" "51624" "51624" "51625" ...
 $ Age_yr  : num [1:10000] 34 34 34 4 49 9 8 45 45 45 ...
 $ BMI     : num [1:10000] 32.2 32.2 32.2 15.3 30.6 ...
 $ BP      : num [1:10000] 113 113 113 NA 112 86 107 118 118 118 ...
 $ SurveyYr: chr [1:10000] "2009_10" "2009_10" "2009_10" "2009_10" ...
 $ Gender  : chr [1:10000] "male" "male" "male" "male" ...


This code checks the type of the `nhanes` object in R:

* `class(nhanes)` returns the **class** of the object, which indicates what kind of R object it is (likely `"tbl_df"`, `"tbl"`, `"data.frame"` for a tibble).
* `typeof(nhanes)` returns the **internal storage mode** of the object, showing the low-level data type used to store it (usually `"list"` for tibbles and data frames).

Together, these functions help you understand both the high-level object type and the underlying storage structure.


In [11]:
#find the class of the data object "nhanes"
class(nhanes)
typeof(nhanes)

This code creates a frequency table of the `Gender` column in the `nhanes` dataset:

* `table(nhanes$Gender)` counts the number of occurrences of each unique value in the `Gender` column.
* It returns a summary showing how many rows correspond to each gender category, helping to understand the distribution of genders in the data.


In [12]:
# Using 'table' function for frequency table
table(nhanes$Gender)



female   male 
  5020   4980 

This code generates summary statistics for the `nhanes` dataset and saves the output to a text file, while also printing the summary in the notebook:

* `capture.output({ ... })` captures all printed output inside the braces into a character vector `summary_text`.
* Inside the capture block:

  * `cat('Summary statistics of the data before cleaning:....\n')` prints a header message.
  * `summary(as.data.frame(nhanes))` generates summary statistics for the dataset converted to a base data frame (which helps avoid tibble printing quirks).
* `writeLines(summary_text, 'sumary_stats_before.txt')` writes the captured summary text to a file named `sumary_stats_before.txt`.
* Finally, `summary(nhanes)` prints the summary directly to the console or notebook output.

This approach ensures you have the summary saved for later review while also displaying it immediately.


In [13]:
# Find the summary statistics of the dataset and save the output in another file
summary_text <- capture.output({
  cat('Summary statistics of the data before cleaning:....\n')
  summary(as.data.frame(nhanes))
})

writeLines(summary_text, 'sumary_stats_before.txt')

summary(nhanes)

      ID                Age_yr           BMI              BP       
 Length:10000       Min.   : 0.00   Min.   :12.88   Min.   : 76.0  
 Class :character   1st Qu.:17.00   1st Qu.:21.58   1st Qu.:106.0  
 Mode  :character   Median :36.00   Median :25.98   Median :116.0  
                    Mean   :36.74   Mean   :26.66   Mean   :118.2  
                    3rd Qu.:54.00   3rd Qu.:30.89   3rd Qu.:127.0  
                    Max.   :80.00   Max.   :81.25   Max.   :226.0  
                                    NA's   :366     NA's   :1449   
   SurveyYr            Gender         
 Length:10000       Length:10000      
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
                                      

This code demonstrates how to split and then recombine a column in the `nhanes` dataset using `tidyr` functions:

* `separate(nhanes, SurveyYr, into = c("Year1", "Year2"), sep = '_', remove = TRUE)`:

  * Splits the `SurveyYr` column into two new columns named `Year1` and `Year2` by splitting at the underscore (`'_'`).
  * The original `SurveyYr` column is removed (`remove = TRUE`).
  * The resulting dataset is stored as `nhanes1`.

* `head(nhanes1)` shows the first six rows of this modified dataset with the split columns.

* `unite(nhanes1, "SurveyYr", c("Year1", "Year2"), sep = "_")`:

  * Combines the two columns `Year1` and `Year2` back into a single `SurveyYr` column, joining them with an underscore.
  * The result is stored back into `nhanes`.

* `head(nhanes)` shows the first six rows of the dataset after recombining.


In [14]:
# Using function 'unite' and 'separate' for column 'SurveyYr'
nhanes1 <- separate(nhanes, SurveyYr, into = c("Year1", "Year2"),
                    sep = '_', remove=TRUE) 
head(nhanes1)

nhanes <- unite(nhanes1, "SurveyYr", c("Year1", "Year2"), sep="_")
head(nhanes)

ID,Age_yr,BMI,BP,Year1,Year2,Gender
<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>
51624,34,32.22,113.0,2009,10,male
51624,34,32.22,113.0,2009,10,male
51624,34,32.22,113.0,2009,10,male
51625,4,15.3,,2009,10,male
51630,49,30.57,112.0,2009,10,female
51638,9,16.82,86.0,2009,10,male


ID,Age_yr,BMI,BP,SurveyYr,Gender
<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
51624,34,32.22,113.0,2009_10,male
51624,34,32.22,113.0,2009_10,male
51624,34,32.22,113.0,2009_10,male
51625,4,15.3,,2009_10,male
51630,49,30.57,112.0,2009_10,female
51638,9,16.82,86.0,2009_10,male


### Cleaning and Subsetting the Data

This code checks for duplicate rows in the `nhanes` dataset:

* `print('Duplicated rows...')` prints a message to indicate the next operation.
* `duplicated(nhanes)` returns a logical vector indicating which rows in the dataset are duplicates of previous rows:

  * `TRUE` means the row is a duplicate.
  * `FALSE` means it’s unique or the first occurrence.

This helps identify if there are any repeated rows in the data.


In [15]:
# Identify duplicate elements
print('Duplicated rows...')
print(duplicated(nhanes))

[1] "Duplicated rows..."
    [1] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
   [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
   [25]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
   [37]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
   [49] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
   [61] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE
   [73]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE
   [85]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
   [97] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
  [109]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
  [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
  [133]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
  [145] FALSE  

This code calculates and prints the number of duplicate rows in the `nhanes` dataset:

* `print('count of duplicate rows')` outputs a message indicating what is being calculated.
* `sum(duplicated(nhanes))` counts how many rows are duplicates by summing the logical vector returned by `duplicated()`, where `TRUE` counts as 1 and `FALSE` as 0.

This gives the total number of duplicate rows present in the dataset.


In [16]:
# count of duplicated data
print('count of duplicate rows')
sum(duplicated(nhanes))

[1] "count of duplicate rows"


This code creates datasets with unique rows and examines their contents:

* `data_uniq <- unique(nhanes)` creates a new dataset by removing duplicate rows from `nhanes` using the base R `unique()` function.
* `data_dist <- distinct(nhanes)` creates a new dataset with distinct rows from `nhanes` using the `distinct()` function from `dplyr`. Both `unique()` and `distinct()` generally serve the same purpose here.
* `head(data_uniq)` displays the first six rows of the dataset with duplicates removed.
* `dim(data_uniq)` returns the dimensions (number of rows and columns) of the unique dataset.

This helps confirm how many unique rows remain after removing duplicates.


In [17]:
# remove duplicate rows using both "unique" and "distinct"
data_uniq <- unique(nhanes)
data_dist <- distinct(nhanes)

head(data_uniq)
dim(data_uniq)

ID,Age_yr,BMI,BP,SurveyYr,Gender
<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
51624,34,32.22,113.0,2009_10,male
51625,4,15.3,,2009_10,male
51630,49,30.57,112.0,2009_10,female
51638,9,16.82,86.0,2009_10,male
51646,8,20.64,107.0,2009_10,male
51647,45,27.24,118.0,2009_10,female


This code inspects the `data_dist` dataset, which contains distinct rows from `nhanes`:

* `head(data_dist)` displays the first six rows of the `data_dist` dataset to preview the beginning of the data.
* `tail(data_dist)` displays the last six rows to preview the end of the data.
* `dim(data_dist)` returns the dimensions of `data_dist` (number of rows and columns).

These commands help verify the contents and size of the dataset after removing duplicates.


In [18]:
# remove duplicate rows using distinct -- needs tidyverse/dplyr
head(data_dist)
tail(data_dist)
dim(data_dist)

ID,Age_yr,BMI,BP,SurveyYr,Gender
<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
51624,34,32.22,113.0,2009_10,male
51625,4,15.3,,2009_10,male
51630,49,30.57,112.0,2009_10,female
51638,9,16.82,86.0,2009_10,male
51646,8,20.64,107.0,2009_10,male
51647,45,27.24,118.0,2009_10,female


ID,Age_yr,BMI,BP,SurveyYr,Gender
<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
71907,80,23.2,148.0,2011_12,male
71908,66,35.1,114.0,2011_12,female
71909,28,29.4,124.0,2011_12,male
71910,0,,,2011_12,female
71911,27,31.3,133.0,2011_12,male
71915,60,27.5,147.0,2011_12,male


This code attempts to perform substitutions on the `ID` column in `data_dist` using the `sub()` function:

* `sub('5*', 'subj', data_dist$ID)` replaces the first occurrence of the pattern `'5*'` in each element of the `ID` vector with the string `"subj"`.

  * Note: In regex, `'5*'` means “zero or more 5s,” so this will match in many places, possibly not as intended.
* `sub('subj', '5', data_dist$ID)` replaces the first occurrence of `"subj"` in each element of `ID` back to `"5"`.
* `tail(data_dist)` shows the last six rows of the `data_dist` dataset (note that the substitutions above do **not** modify `data_dist` unless you assign the results back).

**Important:**
These `sub()` calls return modified vectors but **do not change the original `data_dist` unless assigned**


In [19]:
# Replacing specific string in the whole column with another string
sub('5*','subj', data_dist$ID)
sub('subj','5', data_dist$ID)
tail(data_dist)

ID,Age_yr,BMI,BP,SurveyYr,Gender
<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
71907,80,23.2,148.0,2011_12,male
71908,66,35.1,114.0,2011_12,female
71909,28,29.4,124.0,2011_12,male
71910,0,,,2011_12,female
71911,27,31.3,133.0,2011_12,male
71915,60,27.5,147.0,2011_12,male


### Subsetting the data using condition on columns using "filter" function

This code filters the `data_dist` dataset to create a subset and inspects it:

* `filter(data_dist, Age_yr == 34, Gender == 'male')` selects rows where the `Age_yr` column equals 34 **and** the `Gender` column equals `'male'`. This uses the [`filter()`](https://dplyr.tidyverse.org/reference/filter.html) function from dplyr.
* The resulting filtered data is saved as `age_34_data`.
* `age_34_data` prints the filtered dataset to the console or notebook.
* `dim(age_34_data)` returns the dimensions (number of rows and columns) of this subset, showing how many male individuals aged 34 are present.


In [20]:
# Condition -- Find all columns where Age is 34 and gender is 'male'
age_34_data <- filter(data_dist, Age_yr == 34, Gender == 'male')
age_34_data
dim(age_34_data)

ID,Age_yr,BMI,BP,SurveyYr,Gender
<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
51624,34,32.22,113,2009_10,male
52789,34,22.31,107,2009_10,male
54123,34,26.96,104,2009_10,male
54148,34,28.69,124,2009_10,male
54564,34,28.7,116,2009_10,male
54636,34,20.45,106,2009_10,male
56089,34,40.68,124,2009_10,male
56624,34,28.59,120,2009_10,male
57024,34,31.06,123,2009_10,male
57184,34,31.23,124,2009_10,male


This code creates a subset of `data_dist` with male individuals whose age falls within a specified range:

* The pipe operator `%>%` passes `data_dist` into the [`filter()`](https://dplyr.tidyverse.org/reference/filter.html) function.
* `filter(between(Age_yr, 34, 40), Gender == 'male')` selects rows where:

  * `Age_yr` is between 34 and 40 (inclusive), using the `between()` function.
  * `Gender` is "male".
* The filtered dataset is saved as `age_range_data`.
* `age_range_data` prints the filtered data to the console or notebook output.



In [21]:
#Condition --  find the data for age between 34 to 40
age_range_data <- data_dist %>% filter(between(Age_yr, 34, 40), Gender == 'male')
age_range_data

ID,Age_yr,BMI,BP,SurveyYr,Gender
<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
51624,34,32.22,113,2009_10,male
51694,38,35.84,147,2009_10,male
51701,36,25.95,117,2009_10,male
51832,38,25.91,118,2009_10,male
51991,35,28.82,121,2009_10,male
52013,36,28.09,102,2009_10,male
52050,39,32.70,127,2009_10,male
52179,40,35.45,117,2009_10,male
52197,39,40.66,140,2009_10,male
52214,38,26.66,125,2009_10,male


This code filters the `data_dist` dataset to select male individuals older than 50:

* The pipe operator `%>%` passes `data_dist` into `filter()`.
* `filter(Age_yr > 50, Gender == 'male')` selects rows where:

  * `Age_yr` is greater than 50.
  * `Gender` is `'male'`.
* The resulting subset is saved as `age_gt_data`.
* `age_gt_data` prints this filtered dataset to the console or notebook output.


In [22]:
#Condition --  find the data for age greater than 50
age_gt_data <- data_dist %>% filter(Age_yr > 50, Gender == 'male')
age_gt_data

ID,Age_yr,BMI,BP,SurveyYr,Gender
<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
51654,66,23.67,111,2009_10,male
51656,58,23.69,104,2009_10,male
51657,54,26.03,134,2009_10,male
51678,60,25.84,152,2009_10,male
51692,54,36.32,90,2009_10,male
51748,56,24.98,86,2009_10,male
51803,58,29.27,139,2009_10,male
51804,80,17.87,132,2009_10,male
51819,80,34.04,125,2009_10,male
51861,80,29.34,149,2009_10,male


## Handling Missing Data

This code prints the number of missing (`NA`) values in specific columns of the `data_dist` dataset:

* For each column (`BP`, `BMI`, `Age_yr`, `ID`), it calculates the total count of `NA` values using `sum(is.na(...))`.
* `cat()` is used to combine a descriptive message and the count into a single output.


In [23]:
# Count the number of rows with column values as NA
cat('Number of missing values in column BloodPressure:', sum(is.na(data_dist$BP)), '\n')
cat('Number of missing values in column BMI:', sum(is.na(data_dist$BMI)), '\n')
cat('Number of missing values in column Age:',sum(is.na(data_dist$Age_yr)), '\n')
cat('Number of missing values in column ID:',sum(is.na(data_dist$ID)), '\n')

Number of missing values in column BloodPressure: 1137 
Number of missing values in column BMI: 304 
Number of missing values in column Age: 0 
Number of missing values in column ID: 0 


### Data Imputation - Replacing NA by Some Other Values

This code performs **mean imputation** to fill missing values (`NA`s) in the `data_dist` dataset and then checks the number of remaining missing values:

* `impData_mean <- data_dist` creates a copy of `data_dist` called `impData_mean` to keep the original data intact.
* `impData_mean$BP[is.na(impData_mean$BP)] <- mean(impData_mean$BP, na.rm = TRUE)`:

  * Calculates the mean of the `BP` column, ignoring missing values (`na.rm = TRUE`).
  * Replaces all `NA`s in the `BP` column with this mean value.
* Similarly, missing values in the `BMI` column are replaced with the column’s mean.
* `cat(' The number of NA in imputed dataframe', sum(is.na(impData_mean$BP)))` prints how many `NA`s remain in the imputed `BP` column (should be zero after imputation).

---

If you want, I can help you write code to check missing values in other columns or implement other imputation methods!


In [24]:
# Replace NA by column 'mean'
impData_mean <- data_dist    #create a copy of original data
impData_mean$BP[is.na(impData_mean$BP)] <- mean(impData_mean$BP, na.rm = TRUE) # replace one column
impData_mean$BMI[is.na(impData_mean$BMI)] <- mean(impData_mean$BMI, na.rm = TRUE)
cat(' The number of NA in imputed dataframe', sum(is.na(impData_mean$BP)))

 The number of NA in imputed dataframe 0

This code performs **zero imputation** to fill missing values in the `BP` column:

* `impData_zero <- data_dist` creates a copy of the `data_dist` dataset named `impData_zero`.
* `impData_zero$BP[is.na(impData_zero$BP)] <- 0` replaces all missing (`NA`) values in the `BP` column with zero (`0`).
* `head(impData_zero)` displays the first six rows of the dataset after this replacement.


Zero imputation is a simple method where missing values are replaced with zero, which can be useful in some contexts but may also bias analyses depending on the data.


In [25]:
# Replace NA by some value (say 0)
impData_zero <- data_dist 
impData_zero$BP[is.na(impData_zero$BP)] <- 0
head(impData_zero)

ID,Age_yr,BMI,BP,SurveyYr,Gender
<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
51624,34,32.22,113,2009_10,male
51625,4,15.3,0,2009_10,male
51630,49,30.57,112,2009_10,female
51638,9,16.82,86,2009_10,male
51646,8,20.64,107,2009_10,male
51647,45,27.24,118,2009_10,female


This code removes all rows containing any missing (`NA`) values from the dataset and reports the number of missing values before and after:

* `dataNo_na <- data_dist` creates a copy of `data_dist` named `dataNo_na`.
* `cat('Before deleting all NA rows', sum(is.na(dataNo_na)), '\n')` prints the total count of missing values across the entire dataset before cleaning.
* `dataNo_na <- na.omit(dataNo_na)` removes all rows that have **any** missing values in any column.
* `cat('After deleting all NA rows', sum(is.na(dataNo_na)), '\n')` prints the count of missing values after deletion (should be zero).

This is a straightforward way to get a dataset with no missing data by dropping incomplete records.


In [26]:
# Remove all the rows with NA entries
dataNo_na <- data_dist
cat('Before deleting all NA rows', sum(is.na(dataNo_na)), '\n')
dataNo_na <- na.omit(dataNo_na)
cat('After deleting all NA rows', sum(is.na(dataNo_na)), '\n')

Before deleting all NA rows 1441 
After deleting all NA rows 0 


This code removes rows with missing values specifically in the `BP` column and then counts remaining missing values:

* `rmCol_na <- data_dist %>% drop_na(BP)` uses the `drop_na()` function from `tidyr` to remove any rows where the `BP` column has `NA`.
* The pipe `%>%` passes `data_dist` into `drop_na(BP)`.
* `sum(is.na(rmCol_na))` counts **all** missing values remaining in the resulting dataset across **all columns**.

This method selectively removes rows with missing `BP` but keeps rows with missing values in other columns.


In [27]:
# Removing soecific columns with NA
rmCol_na <- data_dist %>% drop_na(BP)
sum(is.na(rmCol_na))


### Using `apply` Functions For Data With No NA Values

This code uses the `apply()` function to perform operations across columns of the `dataNo_na` dataset:

* `apply(dataNo_na, 2, min)`:

  * Applies the `min` function to each column (`1` specifies rows, `2` specifies columns).
  * Returns the minimum value found in each column.
  * Works properly for numeric columns; for character columns, may return unexpected results or errors.
* `apply(dataNo_na, 2, length)`:

  * Applies the `length` function to each column.
  * Returns the total number of elements (rows) in each column, which should be the same for all columns in a well-formed dataset.

**Note:**
If your dataset has mixed types (numeric, character, factor), it's often safer to apply such functions to only numeric columns, for example:

```r
apply(dataNo_na[, sapply(dataNo_na, is.numeric)], 2, min)
```



In [28]:
# Find min, length of ALL the columns
apply(dataNo_na, 2, min)   # MARGIN = 1 (for row), 2 (for column)
apply(dataNo_na, 2, length) # find the length of each column

This code calculates the mean for selected numeric columns using both `lapply()` and `sapply()`:

* `num_cols <- list(dataNo_na$Age_yr, dataNo_na$BMI, dataNo_na$BP)`:

  * Creates a list containing three numeric vectors from the `dataNo_na` dataset: `Age_yr`, `BMI`, and `BP`.
* `lapply(num_cols, mean)`:

  * Applies the `mean` function to each element (i.e., each column) of the list.
  * Returns a list of mean values.
* `sapply(num_cols, mean)`:

  * Similar to `lapply()`, but simplifies the result to a vector if possible.
  * In this case, it returns a numeric vector containing the mean of each column.

Both approaches compute the mean for each selected column, but `sapply()` provides a cleaner, simplified output.

In [29]:
#LAPPLY - Apply over a list or vector
num_cols <- list(dataNo_na$Age_yr, dataNo_na$BMI, dataNo_na$BP)
lapply(num_cols, mean)  # output as list
sapply(num_cols, mean) # output as vector

## Using `group_by` For Grouping Data By Columns

This code calculates summary statistics (means) for different gender groups in the dataset:

* `dataGrp.gender <- dataNo_na %>% group_by(Gender) %>% summarise(...)`:

  * Groups the `dataNo_na` dataset by the `Gender` column using `group_by()`.
  * Calculates the mean for `Age_yr`, `BP`, and `BMI` within each gender group using `summarise()`.
  * `.groups = 'drop'` ensures that the result is returned as a regular (ungrouped) data frame.

* The results are saved to `dataGrp.gender`, which contains the average `Age_yr`, `BP`, and `BMI` for each gender.

* `print('The mean for age, BP, and BMI by gender is:')` prints a descriptive message.

* `print(dataGrp.gender)` displays the summarized table with means by gender.


In [30]:
# By one column - 'Gender'
dataGrp.gender <- dataNo_na %>% group_by(Gender)  %>%
  summarise(mean_age = mean(Age_yr), 
            mean_BP = mean(BP), 
            mean_BMI = mean(BMI), 
            .groups = 'drop')
print('The mean for age, BP, and BMI by gender is:')
print(dataGrp.gender)

[1] "The mean for age, BP, and BMI by gender is:"
[90m# A tibble: 2 x 4[39m
  Gender mean_age mean_BP mean_BMI
  [3m[90m<chr>[39m[23m     [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m
[90m1[39m female     41.1    116.     27.9
[90m2[39m male       39.5    120.     27.3


This code calculates summary statistics (means) grouped by both gender and survey year:

* `dataGrp.multi <- dataNo_na %>% group_by(Gender, SurveyYr) %>% summarise(...)`:

  * Groups the `dataNo_na` dataset by both `Gender` and `SurveyYr` using `group_by()`.
  * Within each group (combination of gender and year), calculates:

    * `mean_age`: Mean of `Age_yr`.
    * `mean_BP`: Mean of `BP`.
    * `mean_BMI`: Mean of `BMI`.
  * Note: By default, `summarise()` keeps the grouping structure unless you specify `.groups = 'drop'`.

* `print('The mean for age, BP, and BMI by gender and year is:')` prints a descriptive message.

* `print(dataGrp.multi)` displays the summarized table with grouped means by gender and survey year.


In [31]:
# By multiple columns - 'Gender' and 'SurveyYr'
dataGrp.multi <- dataNo_na %>% group_by(Gender, SurveyYr)  %>%
  summarise(mean_age = mean(Age_yr), 
            mean_BP = mean(BP), 
            mean_BMI = mean(BMI))
print('The mean for age, BP, and BMI by gender and year is:')
print(dataGrp.multi)


[1m[22m`summarise()` has grouped output by 'Gender'. You can override using the
`.groups` argument.


[1] "The mean for age, BP, and BMI by gender and year is:"
[90m# A tibble: 4 x 5[39m
[90m# Groups:   Gender [2][39m
  Gender SurveyYr mean_age mean_BP mean_BMI
  [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m       [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m
[90m1[39m female 2009_10      41.4    116.     28.1
[90m2[39m female 2011_12      40.6    117.     27.6
[90m3[39m male   2009_10      39.7    120.     27.5
[90m4[39m male   2011_12      39.4    120.     27.0


This code calculates the other summary statistics grouped by both gender and survey year:
* Groups data by both `Gender` and `SurveyYr`.
* Calculates:

  * `min_age`: Minimum age in each group.
  * `max_age`: Maximum age in each group.
  * `sum_BP`: Sum of blood pressure values in each group.
  * `mean_BMI`: Mean BMI for each group.
  * `count`: Number of observations (rows) per group.
* `.groups = 'drop'` ensures the result is returned as an ungrouped data frame.


In [32]:
# similarly we can use min, max, sum, count function to aggregate
dataGrp.summary <- dataNo_na %>%
  group_by(Gender, SurveyYr) %>%
  summarise(
    min_age = min(Age_yr),
    max_age = max(Age_yr),
    sum_BP = sum(BP),
    mean_BMI = mean(BMI),
    count = n(),   # Count of rows in each group
    .groups = 'drop'
  )

print('Summary statistics by gender and year (min, max, sum, count):')
print(dataGrp.summary)


[1] "Summary statistics by gender and year (min, max, sum, count):"
[90m# A tibble: 4 x 7[39m
  Gender SurveyYr min_age max_age sum_BP mean_BMI count
  [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m      [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m
[90m1[39m female 2009_10        8      80 [4m1[24m[4m7[24m[4m2[24m971     28.1  [4m1[24m497
[90m2[39m female 2011_12        8      80 [4m1[24m[4m5[24m[4m4[24m983     27.6  [4m1[24m328
[90m3[39m male   2009_10        8      80 [4m1[24m[4m7[24m[4m4[24m366     27.5  [4m1[24m458
[90m4[39m male   2011_12        8      80 [4m1[24m[4m5[24m[4m8[24m054     27.0  [4m1[24m315


## Acknowledgements

The Python code used in this notebook was originally written by Vandana Srivastava, AI/Data Science Specialist, University Libraries, USC.

Explanatory text, annotations, and additional instructional content were added by Meara Cox, USC 2026 Graduate.